Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7656

Implement geo box and distance queries using doc values.

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Having geo box and distance queries available as both point and doc-values-based queries means we could use them with IndexOrDocValuesQuery.

      1. LUCENE-7656.patch
        37 kB
        Adrien Grand
      2. LUCENE-7656.patch
        37 kB
        Adrien Grand
      3. LUCENE-7656.patch
        35 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. I did not want the doc-values distance query to perform distance computations all the time so I ended up working on a way to split the bounding box around the distance circle into 4096 smaller boxes, and pre-computing how they relate to the distance query (I reused the logic that we already had in LatLonPointDistanceQuery for the inner nodes). I also switched LatLonPointDistanceQuery to it (for the leaf nodes that return Relation.CELL_CROSSES_QUERY), and interestingly it yielded a 8% speedup when tested with IndexAndSearchOpenStreetMaps.

        I would appreciate if someone who is more comfortable with geo could review as maybe the way LatLonPointDistanceQuery computes relations between a box and a circle relies on assumptions that are not met in this new code?

        Show
        jpountz Adrien Grand added a comment - Here is a patch. I did not want the doc-values distance query to perform distance computations all the time so I ended up working on a way to split the bounding box around the distance circle into 4096 smaller boxes, and pre-computing how they relate to the distance query (I reused the logic that we already had in LatLonPointDistanceQuery for the inner nodes). I also switched LatLonPointDistanceQuery to it (for the leaf nodes that return Relation.CELL_CROSSES_QUERY), and interestingly it yielded a 8% speedup when tested with IndexAndSearchOpenStreetMaps . I would appreciate if someone who is more comfortable with geo could review as maybe the way LatLonPointDistanceQuery computes relations between a box and a circle relies on assumptions that are not met in this new code?
        Hide
        jpountz Adrien Grand added a comment -

        For the record, I also made sure that the factory methods for these new dv-based queries refer to IndexOrDocValuesQuery.

        Show
        jpountz Adrien Grand added a comment - For the record, I also made sure that the factory methods for these new dv-based queries refer to IndexOrDocValuesQuery .
        Hide
        jpountz Adrien Grand added a comment -

        Same patch as previously, but making LatLonPointDistanceQuery implement the scorerSupplier API so that IndexOrDocValuesQuery would work with it.

        Show
        jpountz Adrien Grand added a comment - Same patch as previously, but making LatLonPointDistanceQuery implement the scorerSupplier API so that IndexOrDocValuesQuery would work with it.
        Hide
        mikemccand Michael McCandless added a comment -

        Very cool! I will have a look at the patch ...

        Show
        mikemccand Michael McCandless added a comment - Very cool! I will have a look at the patch ...
        Hide
        mikemccand Michael McCandless added a comment -

        I like this change! It's nice you see a perf gain on the OSM benchmarks. I suppose it would help "big" distance queries more and maybe hurt "tiny" distance queries, since it does the up front work (the DistancePredicate, but that's the right tradeoff.

        It's a bit annoying that, if you use the IndexOrDocValuesQuery, all the same up front work is done twice, and one of them won't be used; maybe we could make it lazy? But that can wait, it's just an opto.

        Since you use bit shifting, it looks like the number of effective cells may be anywhere between 1024 and 4096 right? Do you think two straight integer divisions instead, which could get us usually to 4096 cells, is too costly per hit?

        maybe the way LatLonPointDistanceQuery computes relations between a box and a circle relies on assumptions that are not met in this new code

        I believe you are using it in essentially the same way as before, just different sized cells, so this should be fine.

        Show
        mikemccand Michael McCandless added a comment - I like this change! It's nice you see a perf gain on the OSM benchmarks. I suppose it would help "big" distance queries more and maybe hurt "tiny" distance queries, since it does the up front work (the DistancePredicate , but that's the right tradeoff. It's a bit annoying that, if you use the IndexOrDocValuesQuery , all the same up front work is done twice, and one of them won't be used; maybe we could make it lazy? But that can wait, it's just an opto. Since you use bit shifting, it looks like the number of effective cells may be anywhere between 1024 and 4096 right? Do you think two straight integer divisions instead, which could get us usually to 4096 cells, is too costly per hit? maybe the way LatLonPointDistanceQuery computes relations between a box and a circle relies on assumptions that are not met in this new code I believe you are using it in essentially the same way as before, just different sized cells, so this should be fine.
        Hide
        jpountz Adrien Grand added a comment -

        I suppose it would help "big" distance queries more and maybe hurt "tiny" distance queries, since it does the up front work

        I think the only scenario that gets worse is when the distance is so tiny that the distance range is always contained in a single BKD cell. As soon as you start having crossing cells, that cost is quickly amortized. For instance, say your index has 30 segments with one crossing cell each (which is still a best-case scenario), we already need to perform 30*1024~=30k distance computations. On the other hand, this change needs to do 4096*4~=16k up-front distance computations (regardless of the number of segments since it is computed for a whole query) so if it allows to save 1/2 distance computations, its cost is already amortized.

        the same up front work is done twice, and one of them won't be used

        True, this should be easy to fix!

        Since you use bit shifting, it looks like the number of effective cells may be anywhere between 1024 and 4096 right? Do you think two straight integer divisions instead, which could get us usually to 4096 cells, is too costly per hit?

        You are right about the fact that there are lost cells. Avoiding integer divisions was one reason in favor of bit shifting, but there was another one, which is that they do not create boxes that cross the dateline.

        That said, you make a good point that we should not have to both store and compute relations for those lost cells, let me look into fixing that.

        Show
        jpountz Adrien Grand added a comment - I suppose it would help "big" distance queries more and maybe hurt "tiny" distance queries, since it does the up front work I think the only scenario that gets worse is when the distance is so tiny that the distance range is always contained in a single BKD cell. As soon as you start having crossing cells, that cost is quickly amortized. For instance, say your index has 30 segments with one crossing cell each (which is still a best-case scenario), we already need to perform 30*1024~=30k distance computations. On the other hand, this change needs to do 4096*4~=16k up-front distance computations (regardless of the number of segments since it is computed for a whole query) so if it allows to save 1/2 distance computations, its cost is already amortized. the same up front work is done twice, and one of them won't be used True, this should be easy to fix! Since you use bit shifting, it looks like the number of effective cells may be anywhere between 1024 and 4096 right? Do you think two straight integer divisions instead, which could get us usually to 4096 cells, is too costly per hit? You are right about the fact that there are lost cells. Avoiding integer divisions was one reason in favor of bit shifting, but there was another one, which is that they do not create boxes that cross the dateline. That said, you make a good point that we should not have to both store and compute relations for those lost cells, let me look into fixing that.
        Hide
        jpountz Adrien Grand added a comment -

        Here is an updated patch that should address your comments:

        • It still uses shifting, but there are no lost cells anymore. As a side-effect, this also means we do not compute geo distances for those lost cells at initialization time so it initialization might have got a bit faster.
        • I wanted to look into initializing the distance predicate lazily but remembered that IndexSearcher might call Weight.scorer from multiple threads so this has some complexity that I'd like to delay to another issue. I did more benchmarking to check how much of an issue that is, and initialization constantly runs in less than half a millisecond on my machine, so it is not that annoying to do it twice I think.
        • I moved the distance predicate from Query to Weight. It has some memory footprint, so it is better if queries do not hold a reference to it (eg. if they are used as cache keys).
        • I replaced the Relation[] array with a byte[] that stores ordinals of the enum constants. This looked like a safe win as this 4x smaller array could be more friendly with the cpu cache.
        Show
        jpountz Adrien Grand added a comment - Here is an updated patch that should address your comments: It still uses shifting, but there are no lost cells anymore. As a side-effect, this also means we do not compute geo distances for those lost cells at initialization time so it initialization might have got a bit faster. I wanted to look into initializing the distance predicate lazily but remembered that IndexSearcher might call Weight.scorer from multiple threads so this has some complexity that I'd like to delay to another issue. I did more benchmarking to check how much of an issue that is, and initialization constantly runs in less than half a millisecond on my machine, so it is not that annoying to do it twice I think. I moved the distance predicate from Query to Weight. It has some memory footprint, so it is better if queries do not hold a reference to it (eg. if they are used as cache keys). I replaced the Relation[] array with a byte[] that stores ordinals of the enum constants. This looked like a safe win as this 4x smaller array could be more friendly with the cpu cache.
        Hide
        mikemccand Michael McCandless added a comment -

        I wanted to look into initializing the distance predicate lazily but remembered that IndexSearcher might call Weight.scorer from multiple threads so this has some complexity that I'd like to delay to another issue.

        Ahh, yes, hairy ... that's fine to postpone! The amortized cost of that initialization approaches zero as the index grows ...

        +1, patch looks great!

        Show
        mikemccand Michael McCandless added a comment - I wanted to look into initializing the distance predicate lazily but remembered that IndexSearcher might call Weight.scorer from multiple threads so this has some complexity that I'd like to delay to another issue. Ahh, yes, hairy ... that's fine to postpone! The amortized cost of that initialization approaches zero as the index grows ... +1, patch looks great!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit cd1be78e2cb9a9bc2e65d5adcc7cecca997330b4 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cd1be78 ]

        LUCENE-7656: Implement geo box/distance queries using doc values.

        Show
        jira-bot ASF subversion and git services added a comment - Commit cd1be78e2cb9a9bc2e65d5adcc7cecca997330b4 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cd1be78 ] LUCENE-7656 : Implement geo box/distance queries using doc values.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit cf943c545478e01a2c76013f1c31b96786cdd165 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf943c5 ]

        LUCENE-7656: Implement geo box/distance queries using doc values.

        Show
        jira-bot ASF subversion and git services added a comment - Commit cf943c545478e01a2c76013f1c31b96786cdd165 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf943c5 ] LUCENE-7656 : Implement geo box/distance queries using doc values.
        Hide
        jpountz Adrien Grand added a comment -

        For the record, the nightly benchmarks confirm the speedup. http://people.apache.org/~mikemccand/geobench.html#search-distance

        Show
        jpountz Adrien Grand added a comment - For the record, the nightly benchmarks confirm the speedup. http://people.apache.org/~mikemccand/geobench.html#search-distance
        Hide
        mikemccand Michael McCandless added a comment -

        For the record, the nightly benchmarks confirm the speedup.

        Nice!

        Show
        mikemccand Michael McCandless added a comment - For the record, the nightly benchmarks confirm the speedup. Nice!

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development