Lucene - Core
  1. Lucene - Core
  2. LUCENE-6712

GeoPointField should cut over to DocValues for boundary filtering

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently GeoPointField queries only use the Terms Dictionary for ranges that fall within and on the boundary of the query shape. For boundary ranges the full precision terms are iterated, for within ranges the postings list is used.

      Instead of iterating full precision terms for boundary ranges, this enhancement cuts over to DocValues for post-filtering boundary terms. This allows us to increase precisionStep for GeoPointField thereby reducing the number of terms and the size of the index. This enhancement should also provide a boost in query performance since visiting more docs and fewer terms should be more efficient than visiting fewer docs and more terms.

      1. LUCENE-6712.patch
        31 kB
        Nicholas Knize
      2. LUCENE-6712.patch
        19 kB
        Nicholas Knize
      3. LUCENE-6712.patch
        15 kB
        Nicholas Knize

        Activity

        Hide
        Nicholas Knize added a comment - - edited

        Initial patch with the following changes:

        • adds a GeoPointTermQueryConstantScoreWrapper for cutting over to doc values on boundary terms
        • adds a GeoPointQueryPostFilter interface for Query specific filtering
        • reduces precision_step from 6 to 8 (8 terms per point instead of 11)

        Initial rough performance benchmarks indicate ~20% boost in query performance and ~42% smaller index (still need to run some due diligence on these numbers).

        Show
        Nicholas Knize added a comment - - edited Initial patch with the following changes: adds a GeoPointTermQueryConstantScoreWrapper for cutting over to doc values on boundary terms adds a GeoPointQueryPostFilter interface for Query specific filtering reduces precision_step from 6 to 8 (8 terms per point instead of 11) Initial rough performance benchmarks indicate ~20% boost in query performance and ~42% smaller index (still need to run some due diligence on these numbers).
        Hide
        Nicholas Knize added a comment -

        Updated patch includes the following:

        • added GeoPointTermQueryConstantScoreWrapper accidentally left out of last patch (forgot to add the file)
        • reduces precision_step to 9 to further reduce range recursion and index size
        • removed some duplicate diffs from LUCENE-6704

        Due Diligence Benchmarks (on 60M point test data):

        index time: 454.891 ms from 633.196 ms (28% performance gain)
        index size: 3.9G from 4.7G (17% reduction)
        search time: 0.018 ms/query from 0.028 ms/query (28% performance gain)

        Show
        Nicholas Knize added a comment - Updated patch includes the following: added GeoPointTermQueryConstantScoreWrapper accidentally left out of last patch (forgot to add the file) reduces precision_step to 9 to further reduce range recursion and index size removed some duplicate diffs from LUCENE-6704 Due Diligence Benchmarks (on 60M point test data): index time: 454.891 ms from 633.196 ms ( 28% performance gain ) index size: 3.9G from 4.7G ( 17% reduction ) search time: 0.018 ms/query from 0.028 ms/query ( 28% performance gain )
        Hide
        Michael McCandless added a comment -

        This looks great! The performance improvements are awesome, though,
        this is mixing up separate changes I think? One change is cutover to
        doc values for the point filtering of each lat/lon, and the other is
        changing the lower detail level and higher prec step?

        You are using SortedNumeric DVs, which allows for multiple values per
        doc, but at search time it seems like you only look at the first
        value? Shouldn't you iterate through all values and accept the docs
        if any of them were in-bounds? Can you add a test case that exposes
        this? And I guess update javadocs to make it clear multi-valued is
        allowed? Hmm and we should remove mention of LUCENE-6481
        from the GeoPointField javadocs (it's fixed!).

        Can we rename GeoPointPostFilter.filter to .accept? Just to make it
        clear that a true value returned means you keep it...

        Why even have a DefaultGeoPointQueryPostFilter (bbox shape)? Seems
        like it should be abstract and each query impl should provide it? And
        we should add a BBoxGeoPointTermsEnum?

        Actually why make a separate GeoPointPostFilter interface? Couldn't
        GeoPointTermsEnum just have an abstract acceptLatLon method?

        It looks like you continue using full precision terms (to the detail
        level) to approximate the shape's boundary right? But, with this
        cutover to DVs, you now have the freedom to use lower precision terms
        even on the boundaries... not sure it's necessary/helpful, but it
        would reduce max number of terms in the adversary cases, but mean more
        post-filtering.

        Show
        Michael McCandless added a comment - This looks great! The performance improvements are awesome, though, this is mixing up separate changes I think? One change is cutover to doc values for the point filtering of each lat/lon, and the other is changing the lower detail level and higher prec step? You are using SortedNumeric DVs, which allows for multiple values per doc, but at search time it seems like you only look at the first value? Shouldn't you iterate through all values and accept the docs if any of them were in-bounds? Can you add a test case that exposes this? And I guess update javadocs to make it clear multi-valued is allowed? Hmm and we should remove mention of LUCENE-6481 from the GeoPointField javadocs (it's fixed!). Can we rename GeoPointPostFilter.filter to .accept? Just to make it clear that a true value returned means you keep it... Why even have a DefaultGeoPointQueryPostFilter (bbox shape)? Seems like it should be abstract and each query impl should provide it? And we should add a BBoxGeoPointTermsEnum? Actually why make a separate GeoPointPostFilter interface? Couldn't GeoPointTermsEnum just have an abstract acceptLatLon method? It looks like you continue using full precision terms (to the detail level) to approximate the shape's boundary right? But, with this cutover to DVs, you now have the freedom to use lower precision terms even on the boundaries... not sure it's necessary/helpful, but it would reduce max number of terms in the adversary cases, but mean more post-filtering.
        Hide
        Nicholas Knize added a comment -

        Awesome! Thanks for the review Mike! Updated patch to address comments is attached.

        this is mixing up separate changes I think? One change is cutover to doc values for the point filtering of each lat/lon, and the other is changing the lower detail level and higher prec step?

        Indeed. The former gives search performance improvements, the latter gives indexing performance improvements. I can split these into 2 patches if desired? That way we can separately investigate the impact of changing the precision value?

        Shouldn't you iterate through all values and accept the docs if any of them were in-bounds? Can you add a test case that exposes this?

        ++ Thanks for pointing that out! I had intended to change that. Fixed in the attached patch - I also added explicit multi-valued documents and testing to cover this. Random multi-valued documents would be nice, though I don't think it blocks the patch?

        Couldn't GeoPointTermsEnum just have an abstract acceptLatLon method?

        ++ I had gone back and forth about this a couple times. With DV post filtering it makes more sense to now have GeoPointTermsEnum be abstract with an abstract postFilter method. Before, most of the logic was shared, only crosses and within were fully overridden in Poly and Distance query classes. I went ahead and made the change in the attached patch.

        It looks like you continue using full precision terms to approximate the shape's boundary right?

        No, the Range instances are now using lower precision terms for the boundaries (up to PRECISION_STEP * MAX_SHIFT - which works out to no higher than level 18). GPTQConstantScoreWrapper iterates the docIds in the postings list. So full precision terms (32 > level >18) are never used (really just wasting space in the index). I suppose I could modify GeoPointField to only index up to a shift of PRECISION_STEP * MAX_SHIFT and further reduce the index size?

        Show
        Nicholas Knize added a comment - Awesome! Thanks for the review Mike! Updated patch to address comments is attached. this is mixing up separate changes I think? One change is cutover to doc values for the point filtering of each lat/lon, and the other is changing the lower detail level and higher prec step? Indeed. The former gives search performance improvements, the latter gives indexing performance improvements. I can split these into 2 patches if desired? That way we can separately investigate the impact of changing the precision value? Shouldn't you iterate through all values and accept the docs if any of them were in-bounds? Can you add a test case that exposes this? ++ Thanks for pointing that out! I had intended to change that. Fixed in the attached patch - I also added explicit multi-valued documents and testing to cover this. Random multi-valued documents would be nice, though I don't think it blocks the patch? Couldn't GeoPointTermsEnum just have an abstract acceptLatLon method? ++ I had gone back and forth about this a couple times. With DV post filtering it makes more sense to now have GeoPointTermsEnum be abstract with an abstract postFilter method. Before, most of the logic was shared, only crosses and within were fully overridden in Poly and Distance query classes. I went ahead and made the change in the attached patch. It looks like you continue using full precision terms to approximate the shape's boundary right? No, the Range instances are now using lower precision terms for the boundaries (up to PRECISION_STEP * MAX_SHIFT - which works out to no higher than level 18). GPTQConstantScoreWrapper iterates the docIds in the postings list. So full precision terms (32 > level >18) are never used (really just wasting space in the index). I suppose I could modify GeoPointField to only index up to a shift of PRECISION_STEP * MAX_SHIFT and further reduce the index size?
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize! I'll look more closely at the patch, but quick responses:

        The former gives search performance improvements, the latter gives indexing performance improvements. I can split these into 2 patches if desired? That way we can separately investigate the impact of changing the precision value?

        I don't think we need to separate ...

        Random multi-valued documents would be nice, though I don't think it blocks the patch?

        I agree.

        No, the Range instances are now using lower precision terms for the boundaries

        Oh I see, good! Yeah seems like we shouldn't index any higher precision terms if we never use them in the queries? Hmm, though this may be tricky with NumericType (it always indexes the full precision term, and then shifts of precStep off that).

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ! I'll look more closely at the patch, but quick responses: The former gives search performance improvements, the latter gives indexing performance improvements. I can split these into 2 patches if desired? That way we can separately investigate the impact of changing the precision value? I don't think we need to separate ... Random multi-valued documents would be nice, though I don't think it blocks the patch? I agree. No, the Range instances are now using lower precision terms for the boundaries Oh I see, good! Yeah seems like we shouldn't index any higher precision terms if we never use them in the queries? Hmm, though this may be tricky with NumericType (it always indexes the full precision term, and then shifts of precStep off that).
        Hide
        Michael McCandless added a comment -

        The latest patch looks great, I'll commit soon...

        Show
        Michael McCandless added a comment - The latest patch looks great, I'll commit soon...
        Hide
        ASF subversion and git services added a comment -

        Commit 1694156 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1694156 ]

        LUCENE-6712: use doc-values for GeoPointField boundary cell post filtering

        Show
        ASF subversion and git services added a comment - Commit 1694156 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1694156 ] LUCENE-6712 : use doc-values for GeoPointField boundary cell post filtering
        Hide
        Michael McCandless added a comment -
        Show
        Michael McCandless added a comment - Thanks Nicholas Knize !
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicholas Knize
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development