Lucene - Core
  1. Lucene - Core
  2. LUCENE-4942

Indexed non-point shapes index excessive terms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Indexed non-point shapes are comprised of a set of terms that represent grid cells. Cells completely within the shape or cells on the intersecting edge that are at the maximum detail depth being indexed for the shape are denoted as "leaf" cells. Such cells have a trailing '+' at the end. Such tokens are actually indexed twice, one with the leaf byte and one without.

      The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' cells and so the tokens with '+' are completely redundant.

      The Recursive [algorithm] based PrefixTree Strategy better supports correct search of indexed non-point shapes than TermQuery does and the distinction is relevant. However, the foundational search algorithms used by this strategy (Intersects & Contains; the other 2 are based on these) could each be upgraded to deal with this correctly. Not trivial but very doable.

      In the end, spatial non-point indexes can probably be trimmed my ~40% by doing this.

      1. LUCENE-4942_non-point_excessive_terms.patch
        68 kB
        David Smiley
      2. LUCENE-4942_non-point_excessive_terms.patch
        61 kB
        David Smiley
      3. LUCENE-4942-clone.diff
        3 kB
        Ryan McKinley
      4. spatial.alg
        4 kB
        David Smiley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          Without the + (or equivalent) how do you know that everything below that is covered by the shape?

          Show
          Ryan McKinley added a comment - Without the + (or equivalent) how do you know that everything below that is covered by the shape?
          Hide
          David Smiley added a comment -

          You don't This is why I believe TermQueryStrategy is fundamentally flawed for indexing non-point shapes. Yet AFAIK it's the choice ElasticSearch wants to use (or at least wanted). In ES if you indexed a country and your search box is something small in the middle of that country, you won't match that country.

          To be clear I'm recommending two things:

          • Have TermQueryStrategy not index its leaves with the '+' – it doesn't use them.
          • Have RecursivePrefixTreeStrategy only index the leaf versions of those leaf cells, not a redundant non-leaf version. Some non-trivial code needs to change in a few of the search algorithms.

          In both cases, the semantics are the same; no new or fewer documents match. But the spatial index is ~40% smaller I figure, faster indexing as well. It's possible some of the search algorithms for RecursivePrefixTreeStrategy will be slightly slower since sometimes they'll need to visit an additional token at certain parts of the algorithms to check for both leaf and non-leaf indexed cells but I think it'll be quite negligible.

          Show
          David Smiley added a comment - You don't This is why I believe TermQueryStrategy is fundamentally flawed for indexing non-point shapes. Yet AFAIK it's the choice ElasticSearch wants to use (or at least wanted). In ES if you indexed a country and your search box is something small in the middle of that country, you won't match that country. To be clear I'm recommending two things: Have TermQueryStrategy not index its leaves with the '+' – it doesn't use them. Have RecursivePrefixTreeStrategy only index the leaf versions of those leaf cells, not a redundant non-leaf version. Some non-trivial code needs to change in a few of the search algorithms. In both cases, the semantics are the same; no new or fewer documents match. But the spatial index is ~40% smaller I figure, faster indexing as well. It's possible some of the search algorithms for RecursivePrefixTreeStrategy will be slightly slower since sometimes they'll need to visit an additional token at certain parts of the algorithms to check for both leaf and non-leaf indexed cells but I think it'll be quite negligible.
          Hide
          Ryan McKinley added a comment -

          I see – so only index the leaves and traverse the terms for each query rather then a pile of term queries.

          Sounds good, but it seems like benchmarking is the only way to know if it is a reasonable tradeoff!

          Show
          Ryan McKinley added a comment - I see – so only index the leaves and traverse the terms for each query rather then a pile of term queries. Sounds good, but it seems like benchmarking is the only way to know if it is a reasonable tradeoff!
          Hide
          David Smiley added a comment -

          There definitely needs to be benchmarking for spatial; but I feel confident in this case that that it'll be well worth it for RPT; I'm quite familiar with the algorithms in there. It's an unquestionable win-win for TermQueryStrategy.

          Show
          David Smiley added a comment - There definitely needs to be benchmarking for spatial; but I feel confident in this case that that it'll be well worth it for RPT; I'm quite familiar with the algorithms in there. It's an unquestionable win-win for TermQueryStrategy.
          Hide
          David Smiley added a comment -

          Somewhat related to this is my newfound realization that indexed non-point shapes will result in IntersectsPrefixTreeFilter (technically it's actually VisitorTemplate) scanning over these smallest grid cells / terms twice and thus calculate intersection twice – once with the leaf flag, once without. This is likely a major performance bug. It would be awkward to fix that right now, but it would be easy once there simply wasn't this redundant indexing of terms – hence this issue.

          Show
          David Smiley added a comment - Somewhat related to this is my newfound realization that indexed non-point shapes will result in IntersectsPrefixTreeFilter (technically it's actually VisitorTemplate) scanning over these smallest grid cells / terms twice and thus calculate intersection twice – once with the leaf flag, once without. This is likely a major performance bug. It would be awkward to fix that right now, but it would be easy once there simply wasn't this redundant indexing of terms – hence this issue.
          Hide
          David Smiley added a comment - - edited

          I'd like to solve this issue (excessive terms) and also address differentiating between fully-contained leaves vs approximated leaves (for LUCENE-5776) in one go tracked on this issue to avoid dealing with back-compat more than once. That is, just once we change how PrefixTree derivative strategies encode the term data, instead of doing over more than one issue. And I'm thinking on trunk wouldn't worry about the back-compat (it is trunk after all), and then the port to 5x would have to consider it – the down-side being some spatial code on trunk vs 5x may vary a bit. Perhaps the back-compat detection in 5x would work via a check for Version similar to Analyzer's having a version property that can optionally be set.

          I'm not sure how contentious it may be to simply forgo back-compat. Just re-index. And you're not affected if all you have is point data, which seems to be at least 80% of the users using spatial. And you're furthermore not affected if your pre-existing indexes have non-point data but the only predicate you use is Intersects (no Contains, no Within, no heatmaps). Again I'd guess that lobs off another 80% of users since Intersects is so common.

          Show
          David Smiley added a comment - - edited I'd like to solve this issue (excessive terms) and also address differentiating between fully-contained leaves vs approximated leaves (for LUCENE-5776 ) in one go tracked on this issue to avoid dealing with back-compat more than once. That is, just once we change how PrefixTree derivative strategies encode the term data, instead of doing over more than one issue. And I'm thinking on trunk wouldn't worry about the back-compat (it is trunk after all), and then the port to 5x would have to consider it – the down-side being some spatial code on trunk vs 5x may vary a bit. Perhaps the back-compat detection in 5x would work via a check for Version similar to Analyzer's having a version property that can optionally be set. I'm not sure how contentious it may be to simply forgo back-compat. Just re-index. And you're not affected if all you have is point data, which seems to be at least 80% of the users using spatial. And you're furthermore not affected if your pre-existing indexes have non-point data but the only predicate you use is Intersects (no Contains, no Within, no heatmaps). Again I'd guess that lobs off another 80% of users since Intersects is so common.
          Hide
          Ryan McKinley added a comment - - edited

          +1

          I'm not sure how contentious it may be to simply forgo back-compat

          I expect anyone would would be affected would rejoice at the tradeoff. As is, the people who would be affected either have very few documents or ginormous indexes.

          We could take a poll on solr-user to see if anyone is using RPT for non-points and a query is not Intersects (no need to worry about heatmaps... it has not been released yet!)

          Show
          Ryan McKinley added a comment - - edited +1 I'm not sure how contentious it may be to simply forgo back-compat I expect anyone would would be affected would rejoice at the tradeoff. As is, the people who would be affected either have very few documents or ginormous indexes. We could take a poll on solr-user to see if anyone is using RPT for non-points and a query is not Intersects (no need to worry about heatmaps... it has not been released yet!)
          Hide
          David Smiley added a comment -

          The attached patch does not have the "+" / "*" (approximated leaf vs contained leaf) leaf type differentiation; that can wait.

          Summary of patch changes:

          • CellTokenStream: removed the dual/redundant indexing it was doing for leaf cells. This simplified it, and I further simplified it to the point that CTS is now really a generic TokenStream for a BytesRefIterator you give it. I have a nocommit to rename CellTokenStream to BytesRefIteratorTokenStream.
          • Related to the CellTokenStream change, I refactored PrefixTreeStrategy a little to now have a protected createCellIteratorToIndex() and protected newCellToBytesRefIterator(), and added a CellToBytesRefIterator class. The particular arrangement paves the way for TokenStream re-use — LUCENE-5776 although leaves the actual re-use to occur later in a future patch on that issue.
          • TermQueryPrefixTreeStrategy overrides newCellToBytesRefIterator to return a CTBRI subclass that does not have the leaf byte (since this strategy doesn’t query for them).
          • Primary search-time changes were in AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps), WithinPrefixTreeFilter, and ContainsPrefixTreeFilter.
          • ContainsPrefixTreeFilter now does more leap-frogging than it used to; it’s probably a bit faster as a result.
          • Enhanced the toString()’s in the Filters to include the query shape.
          • (Refactoring) Cell.isLeaf() should always return true if it’s level == maxLevels, and I clarified that when cell.isLeaf is false then this means this cell is a “prefix” (effectively the opposite of a leaf) which means there are cells at further resolutions (greater levels). For Quad & Geohash PrefixTree’s, it’s an implementation detail that it doesn’t append the ‘+’ because doing so is redundant/implied.
          • (Refactoring) AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps) no longer has a hasIndexedLeaves boolean flag to supposedly make it faster for the all-points case. The checks where it might be relevant are very cheap so I’d rather keep this class simpler.

          Tests pass; I'll try precommit later. I've yet to try lucene/benchmark and examine the index size change.

          Show
          David Smiley added a comment - The attached patch does not have the "+" / "*" (approximated leaf vs contained leaf) leaf type differentiation; that can wait. Summary of patch changes: CellTokenStream: removed the dual/redundant indexing it was doing for leaf cells. This simplified it, and I further simplified it to the point that CTS is now really a generic TokenStream for a BytesRefIterator you give it. I have a nocommit to rename CellTokenStream to BytesRefIteratorTokenStream. Related to the CellTokenStream change, I refactored PrefixTreeStrategy a little to now have a protected createCellIteratorToIndex() and protected newCellToBytesRefIterator(), and added a CellToBytesRefIterator class. The particular arrangement paves the way for TokenStream re-use — LUCENE-5776 although leaves the actual re-use to occur later in a future patch on that issue. TermQueryPrefixTreeStrategy overrides newCellToBytesRefIterator to return a CTBRI subclass that does not have the leaf byte (since this strategy doesn’t query for them). Primary search-time changes were in AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps), WithinPrefixTreeFilter, and ContainsPrefixTreeFilter. ContainsPrefixTreeFilter now does more leap-frogging than it used to; it’s probably a bit faster as a result. Enhanced the toString()’s in the Filters to include the query shape. (Refactoring) Cell.isLeaf() should always return true if it’s level == maxLevels, and I clarified that when cell.isLeaf is false then this means this cell is a “prefix” (effectively the opposite of a leaf) which means there are cells at further resolutions (greater levels). For Quad & Geohash PrefixTree’s, it’s an implementation detail that it doesn’t append the ‘+’ because doing so is redundant/implied. (Refactoring) AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps) no longer has a hasIndexedLeaves boolean flag to supposedly make it faster for the all-points case. The checks where it might be relevant are very cheap so I’d rather keep this class simpler. Tests pass; I'll try precommit later. I've yet to try lucene/benchmark and examine the index size change.
          Hide
          David Smiley added a comment -

          Boo ya! It turns out the changes work with the previous index scheme with a trivial change. I tested this by temporarily subclassing RPT to return a CellToBytesRefIterator that has the redundant prefixes on leaves. I'll now create this as another test (subclassing the regular one) so that this stays tested.

          Show
          David Smiley added a comment - Boo ya! It turns out the changes work with the previous index scheme with a trivial change. I tested this by temporarily subclassing RPT to return a CellToBytesRefIterator that has the redundant prefixes on leaves. I'll now create this as another test (subclassing the regular one) so that this stays tested.
          Hide
          David Smiley added a comment -

          I ran a benchmark comparison. The improvements are sweet all around!
          33% smaller index
          35% less memory while indexing
          68% faster indexing (wow)
          44% faster searching (wow)

          I used a benchmark '.alg' file with quad tree, with reproducibly random circles to index and to search. Search was "Intersects". I attached the spatial.alg should someone want to see more parameters.

          Show
          David Smiley added a comment - I ran a benchmark comparison. The improvements are sweet all around! 33% smaller index 35% less memory while indexing 68% faster indexing (wow) 44% faster searching (wow) I used a benchmark '.alg' file with quad tree, with reproducibly random circles to index and to search. Search was "Intersects". I attached the spatial.alg should someone want to see more parameters.
          Hide
          Ryan McKinley added a comment -

          nice!

          Show
          Ryan McKinley added a comment - nice!
          Hide
          David Smiley added a comment -

          Here's a slightly updated patch, doing the CellTokenStream rename, and addressed the nocommits. ant precommit passes. I think this is ready to commit and get Jenkins beating on it.

          I think the Index tweaks planned for LUCENE-5579 need not bleed into this issue after all since I think back-compat will also come for free (no special code) and besides the leaf cell type differentiation should be set with a flag (do you want the differentiation or not?).

          Show
          David Smiley added a comment - Here's a slightly updated patch, doing the CellTokenStream rename, and addressed the nocommits. ant precommit passes. I think this is ready to commit and get Jenkins beating on it. I think the Index tweaks planned for LUCENE-5579 need not bleed into this issue after all since I think back-compat will also come for free (no special code) and besides the leaf cell type differentiation should be set with a flag (do you want the differentiation or not?).
          Hide
          Ryan McKinley added a comment -

          +1 – i did not run the benchmark, but was able to merge and have everything work

          great work!

          Show
          Ryan McKinley added a comment - +1 – i did not run the benchmark, but was able to merge and have everything work great work!
          Hide
          ASF subversion and git services added a comment -

          Commit 1665656 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1665656 ]

          LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data

          • Redundant prefix terms on leaf cells are no longer emitted;
          • Cell.isLeaf should always return true at maxLevels
          Show
          ASF subversion and git services added a comment - Commit 1665656 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1665656 ] LUCENE-4942 : Optimize spatial PrefixTreeStrategy indexes for non-point data Redundant prefix terms on leaf cells are no longer emitted; Cell.isLeaf should always return true at maxLevels
          Hide
          ASF subversion and git services added a comment -

          Commit 1665674 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665674 ]

          LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data

          • Redundant prefix terms on leaf cells are no longer emitted;
          • Cell.isLeaf should always return true at maxLevels
          Show
          ASF subversion and git services added a comment - Commit 1665674 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665674 ] LUCENE-4942 : Optimize spatial PrefixTreeStrategy indexes for non-point data Redundant prefix terms on leaf cells are no longer emitted; Cell.isLeaf should always return true at maxLevels
          Hide
          Ryan McKinley added a comment -

          when fixing a failing test for SOLR-7228, it looks like the issue is caused by changes here.

          Without this patch, the admin UI shows all tokens having the same value. I mucked with the cone() function and now they show the right values. I don't think cone() is called in normal operation – only for the debug properties in the admin UI

          David Smiley can you take a look?

          Show
          Ryan McKinley added a comment - when fixing a failing test for SOLR-7228 , it looks like the issue is caused by changes here. Without this patch, the admin UI shows all tokens having the same value. I mucked with the cone() function and now they show the right values. I don't think cone() is called in normal operation – only for the debug properties in the admin UI David Smiley can you take a look?
          Hide
          ASF subversion and git services added a comment -

          Commit 1665949 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1665949 ]

          LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.

          Show
          ASF subversion and git services added a comment - Commit 1665949 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1665949 ] LUCENE-4942 : Fix BytesRefIteratorTokenStream's attribute clone method.
          Hide
          ASF subversion and git services added a comment -

          Commit 1665951 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665951 ]

          LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.

          Show
          ASF subversion and git services added a comment - Commit 1665951 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665951 ] LUCENE-4942 : Fix BytesRefIteratorTokenStream's attribute clone method.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development