Lucene - Core
  1. Lucene - Core
  2. LUCENE-5408

SerializedDVStrategy -- match geometries in DocValues

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've started work on a new SpatialStrategy implementation I'm tentatively calling SerializedDVStrategy. It's similar to the JtsGeoStrategy in Spatial-Solr-Sandbox but a little different in the details – certainly faster. Using Spatial4j 0.4's BinaryCodec, it'll serialize the shape to bytes (for polygons this in internally WKB format) and the strategy will put it in a BinaryDocValuesField. In practice the shape is likely a polygon but it needn't be. Then I'll implement a Filter that returns a DocIdSetIterator that evaluates a given document passed via advance(docid)) to see if the query shape matches a shape in DocValues. It's improper usage for it to be used in a situation where it will evaluate every document id via nextDoc(). And in practice the DocValues format chosen should be a disk resident one since each value tends to be kind of big.

      This spatial strategy in and of itself has no index; it's O(N) where N is the number of documents that get passed thru it. So it should be placed last in the query/filter tree so that the other queries limit the documents it needs to see. At a minimum, another query/filter to use in conjunction is another SpatialStrategy like RecursivePrefixTreeStrategy.

      Eventually once the PrefixTree grid encoding has a little bit more metadata, it will be possible to further combine the grid & this strategy in such a way that many documents won't need to be checked against the serialized geometry.

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          This is intermediate progress; it needs to be tested. And I hope to possible share a Bits based DocIdSet with Michael McCandless in LUCENE-5418. The sentiment in that issue about how to handle super-slow Filters is a problem here too.

          I had an epiphany last night that the current Spatial RPT grid and algorithm doesn't need to be modified to be able to differentiate the matching docs into confirmed & un-confirmed matches for common scenarios. As such, to prevent mis-use of the expensive Filter returned from this GeometryStrategy, I might force it to be paired with RecursivePrefixTreeStrategy. And then leave an expert method exposed to grab Bits or a Filter purely based on the Geometry DocValues check. ElasticSearch and Solr wouldn't use that but someone coding directly to Lucene would have the ability to wire things together in ways more flexible than are possible in ES or Solr. The most ideal way is to compute a fast pre-filter bitset separate from the slow post-filter, with user keyword queries and other filters in the middle. But the slow post-filter to operate best needs a side-artifact bitset computed when the pre-filter bitset is generated. I'll eventually be more clear in javadocs.

          Show
          David Smiley added a comment - This is intermediate progress; it needs to be tested. And I hope to possible share a Bits based DocIdSet with Michael McCandless in LUCENE-5418 . The sentiment in that issue about how to handle super-slow Filters is a problem here too. I had an epiphany last night that the current Spatial RPT grid and algorithm doesn't need to be modified to be able to differentiate the matching docs into confirmed & un-confirmed matches for common scenarios. As such, to prevent mis-use of the expensive Filter returned from this GeometryStrategy, I might force it to be paired with RecursivePrefixTreeStrategy. And then leave an expert method exposed to grab Bits or a Filter purely based on the Geometry DocValues check. ElasticSearch and Solr wouldn't use that but someone coding directly to Lucene would have the ability to wire things together in ways more flexible than are possible in ES or Solr. The most ideal way is to compute a fast pre-filter bitset separate from the slow post-filter, with user keyword queries and other filters in the middle. But the slow post-filter to operate best needs a side-artifact bitset computed when the pre-filter bitset is generated. I'll eventually be more clear in javadocs.
          Hide
          David Smiley added a comment -

          Technically this only requires LUCENE-4978 when the indexed data is a point, but you wouldn't have a need for this strategy if you were doing that.

          Show
          David Smiley added a comment - Technically this only requires LUCENE-4978 when the indexed data is a point, but you wouldn't have a need for this strategy if you were doing that.
          Hide
          David Smiley added a comment -

          Attached is a working patch with tests. I plan to commit it in 2 days.

          I renamed it from GeometryStrategy to SerializedDVStrategy which is more meaningful. If you call makeQuery you get an UnsupportedOperationException, and if you call makeFilter and then try to grab the iterator, you'll get an UnsupportedOperationException too. The main way the functionality is exposed is via a makeShapeValueSource() that returns Shape instances from FunctionValues.objectVal(doc). From this, I implemented a DistanceToShapeValueSource (from the center of the shape), and a ShapePredicateValueSource that's a boolean.

          Exposing this via Solr will come in another issue. That's not as trivial as the other strategies because this particular one needs to be used in a certain way in Solr – a PostFilter.

          Show
          David Smiley added a comment - Attached is a working patch with tests. I plan to commit it in 2 days. I renamed it from GeometryStrategy to SerializedDVStrategy which is more meaningful. If you call makeQuery you get an UnsupportedOperationException, and if you call makeFilter and then try to grab the iterator, you'll get an UnsupportedOperationException too. The main way the functionality is exposed is via a makeShapeValueSource() that returns Shape instances from FunctionValues.objectVal(doc). From this, I implemented a DistanceToShapeValueSource (from the center of the shape), and a ShapePredicateValueSource that's a boolean. Exposing this via Solr will come in another issue. That's not as trivial as the other strategies because this particular one needs to be used in a certain way in Solr – a PostFilter.
          Hide
          ASF subversion and git services added a comment -

          Commit 1568807 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1568807 ]

          LUCENE-5408: Spatial SerializedDVStrategy

          Show
          ASF subversion and git services added a comment - Commit 1568807 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1568807 ] LUCENE-5408 : Spatial SerializedDVStrategy
          Hide
          ASF subversion and git services added a comment -

          Commit 1568817 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1568817 ]

          LUCENE-5408: fixed tests; some strategies require DocValues

          Show
          ASF subversion and git services added a comment - Commit 1568817 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1568817 ] LUCENE-5408 : fixed tests; some strategies require DocValues
          Hide
          ASF subversion and git services added a comment -

          Commit 1568818 from David Smiley in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1568818 ]

          LUCENE-5408: Spatial SerializedDVStrategy

          Show
          ASF subversion and git services added a comment - Commit 1568818 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1568818 ] LUCENE-5408 : Spatial SerializedDVStrategy

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development