Lucene - Core
  1. Lucene - Core
  2. LUCENE-4770

GeoShape intersects filter omitted matching docs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.1
    • Fix Version/s: 4.2, 6.0
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      SpatialPrefixTree#recursiveGetNodes uses an optimization that prevents
      recursion into the deepest tree level if a parent node in the penultimate
      level covers all its children. This produces a bug if the optimization
      happens both at indexing and at query/filter time.

      Original post

        Activity

        Hide
        Florian Schilling added a comment -

        This patch fixes the SpatialPrefixTree#recursiveGetNodes method and adds two tests for the edge cases. The first test is included at SpatialPrefixTreeTest#testEdgeCases and tests an rectangular query on a rectangular shaped area. The second test in located at JtsPolygonTest#testEdgeCases and tests an rectangular query on a polygonal shaped area.

        Show
        Florian Schilling added a comment - This patch fixes the SpatialPrefixTree#recursiveGetNodes method and adds two tests for the edge cases. The first test is included at SpatialPrefixTreeTest#testEdgeCases and tests an rectangular query on a rectangular shaped area. The second test in located at JtsPolygonTest#testEdgeCases and tests an rectangular query on a polygonal shaped area.
        Hide
        David Smiley added a comment -

        The irony! I was literally in the middle of correcting this problem.

        Show
        David Smiley added a comment - The irony! I was literally in the middle of correcting this problem.
        Hide
        David Smiley added a comment -

        The fix I was working on was to remove the optimization altogether. With it removed, recursiveGetNodes() can be rewritten to be remarkably simpler and combined with the Point specialization of this method into one. I've already done this locally and the code is easier to read and debug. However the optimization is probably a noticeable benefit for the TermQuery strategy. Perhaps this implementation could move to TermQueryPrefixTreeStrategy. Alternatively it would be cool if the nodes could be pruned after the fact (in the strategy) but that would be hard since nodes are flattened into a list. I'm leaning towards moving this implementation to the strategy where I can see ways it could even be optimized more by bubbling this pruning up from now only the bottom level but up additional levels. Thoughts?

        p.s. the condition subCells.size() == node.getSubCellsSize() && !inclParents is not enough, it must also be true that node.getLevel() != 0 because the 0 node is not indexable (no empty string).

        Show
        David Smiley added a comment - The fix I was working on was to remove the optimization altogether. With it removed, recursiveGetNodes() can be rewritten to be remarkably simpler and combined with the Point specialization of this method into one. I've already done this locally and the code is easier to read and debug. However the optimization is probably a noticeable benefit for the TermQuery strategy. Perhaps this implementation could move to TermQueryPrefixTreeStrategy. Alternatively it would be cool if the nodes could be pruned after the fact (in the strategy) but that would be hard since nodes are flattened into a list. I'm leaning towards moving this implementation to the strategy where I can see ways it could even be optimized more by bubbling this pruning up from now only the bottom level but up additional levels. Thoughts? p.s. the condition subCells.size() == node.getSubCellsSize() && !inclParents is not enough, it must also be true that node.getLevel() != 0 because the 0 node is not indexable (no empty string).
        Hide
        Florian Schilling added a comment -

        The bug fix I worked on was simply a lucene version of the suggestion in the original post. The test cases I implemented should also work with your implementation of optimizations independently. Also moving the implementation to TermQueryPrefixTreeStrategy is a reasonable idea, since handling the data should be part of the strategy level.

        Show
        Florian Schilling added a comment - The bug fix I worked on was simply a lucene version of the suggestion in the original post. The test cases I implemented should also work with your implementation of optimizations independently. Also moving the implementation to TermQueryPrefixTreeStrategy is a reasonable idea, since handling the data should be part of the strategy level.
        Hide
        David Smiley added a comment -

        The attached patch:

        • Adds a "simplify" boolean argument to SpatialPrefixTree.getNodes(). If enabled, getNodes() will perform the aforementioned optimization. FYI it results in ~20-25% reduction in cells.
        • Enhanced the simplification algorithm to bubble up to higher levels, pruning away more cells. It becomes exponentially less likely to happen as you move up but it can and will happen.
        • The TermQuery strategy won't simplify at index time (this is a change) and will simplify at search time. Note indexed non-point shapes will use ~20% more disk.
        • The RecursivePrefixTree strategy will continue to simplify indexed shapes. It has a search-time algorithm that doesn't use PrefixTree.getNodes().
        • Simplified the API subclassing contract for SpatialPrefixTree, and thus removed an overriding method in both Geohash & Quad implementations.

        p.s. Thanks so much for developing tests, Florian!

        If there are no problems discussed with this patch then I plan to commit it in ~32 hours.

        Show
        David Smiley added a comment - The attached patch: Adds a "simplify" boolean argument to SpatialPrefixTree.getNodes(). If enabled, getNodes() will perform the aforementioned optimization. FYI it results in ~20-25% reduction in cells. Enhanced the simplification algorithm to bubble up to higher levels, pruning away more cells. It becomes exponentially less likely to happen as you move up but it can and will happen. The TermQuery strategy won't simplify at index time (this is a change) and will simplify at search time. Note indexed non-point shapes will use ~20% more disk. The RecursivePrefixTree strategy will continue to simplify indexed shapes. It has a search-time algorithm that doesn't use PrefixTree.getNodes(). Simplified the API subclassing contract for SpatialPrefixTree, and thus removed an overriding method in both Geohash & Quad implementations. p.s. Thanks so much for developing tests, Florian! If there are no problems discussed with this patch then I plan to commit it in ~32 hours.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] David Wayne Smiley
        http://svn.apache.org/viewvc?view=revision&revision=1446452

        LUCENE-4770: Optional SpatialPrefixTree getNodes simplification

        Show
        Commit Tag Bot added a comment - [trunk commit] David Wayne Smiley http://svn.apache.org/viewvc?view=revision&revision=1446452 LUCENE-4770 : Optional SpatialPrefixTree getNodes simplification
        Hide
        David Smiley added a comment -

        Committed to 4.2 & 5.0 with this CHANGES.txt entry in the bugs area (although it is also applicable as an optimization):

        • LUCENE-4770: If spatial's TermQueryPrefixTreeStrategy was used to search
          indexed non-point shapes, then there was an edge case where a query should
          find a shape but it didn't. The fix is the removal of an optimization that
          simplifies some leaf cells into a parent. The index data for such a field is
          now ~20% larger. This optimization is still done for the query shape, and for
          indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization
          is enhanced to roll up beyond the bottom cell level. (David Smiley,
          Florian Schilling)

        Thanks again for your help with the tests, Florian.

        Show
        David Smiley added a comment - Committed to 4.2 & 5.0 with this CHANGES.txt entry in the bugs area (although it is also applicable as an optimization): LUCENE-4770 : If spatial's TermQueryPrefixTreeStrategy was used to search indexed non-point shapes, then there was an edge case where a query should find a shape but it didn't. The fix is the removal of an optimization that simplifies some leaf cells into a parent. The index data for such a field is now ~20% larger. This optimization is still done for the query shape, and for indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization is enhanced to roll up beyond the bottom cell level. (David Smiley, Florian Schilling) Thanks again for your help with the tests, Florian.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] David Wayne Smiley
        http://svn.apache.org/viewvc?view=revision&revision=1446453

        LUCENE-4770: Optional SpatialPrefixTree getNodes simplification

        Show
        Commit Tag Bot added a comment - [branch_4x commit] David Wayne Smiley http://svn.apache.org/viewvc?view=revision&revision=1446453 LUCENE-4770 : Optional SpatialPrefixTree getNodes simplification
        Hide
        Simon Willnauer added a comment - - edited

        @david is this change backwards compatible? I mean if I used the unpatched version to index will I be able to search correctly with the new version?

        Show
        Simon Willnauer added a comment - - edited @david is this change backwards compatible? I mean if I used the unpatched version to index will I be able to search correctly with the new version?
        Hide
        David Smiley added a comment -

        The problem this bug fixes will still be present if you simply use the new code without reindexing. It should otherwise work fine.

        Show
        David Smiley added a comment - The problem this bug fixes will still be present if you simply use the new code without reindexing. It should otherwise work fine.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            David Smiley
            Reporter:
            Florian Schilling
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development