Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7069

Add LatLonPoint.nearest to find closest indexed point to a given query point

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1, master (7.0)
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      KD trees (used by Lucene's new dimensional points) excel at finding "nearest neighbors" to a given query point ... I think we should add this to Lucene's sandbox as:

        public static Document nearest(IndexReader r, String field, double lat, double lon) throws IOException
      

      I only implemented the 1 nearest neighbor for starters ... I think we can easily generalize this in the future to K nearest.

      It could also be generalized to more than 2 dimensions, but for now I'm making the class package private in sandbox for just the geo2d (lat/lon) use case.

      I don't think this should go into 6.0.0, but should go into 6.1: it's a new feature, and we need to wrap up and ship 6.0.0 already

      1. LUCENE-7069.patch
        32 kB
        Michael McCandless
      2. LUCENE-7069.patch
        32 kB
        Michael McCandless
      3. LUCENE-7069.patch
        17 kB
        Michael McCandless
      4. LUCENE-7069.patch
        16 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch, but it fails its randomized test for some seeds. It's an interesting failure: it happens when the query point is fully outside the shrink wrap bounding box of all indexed points ... I need to mull how to cleanly fix that.

        Show
        mikemccand Michael McCandless added a comment - Initial patch, but it fails its randomized test for some seeds. It's an interesting failure: it happens when the query point is fully outside the shrink wrap bounding box of all indexed points ... I need to mull how to cleanly fix that.
        Hide
        mikemccand Michael McCandless added a comment -

        Here's another iteration, fixing the "global shrink wrap" bug in a simplistic yet effective way so the test now passes beasting

        However I realized there is a nasty adversary across segments ... if the first segment indexed only points really far away from the query point, and then the next segment has tons of points near the query point ... I'll work out a fix ...

        Show
        mikemccand Michael McCandless added a comment - Here's another iteration, fixing the "global shrink wrap" bug in a simplistic yet effective way so the test now passes beasting However I realized there is a nasty adversary across segments ... if the first segment indexed only points really far away from the query point, and then the next segment has tons of points near the query point ... I'll work out a fix ...
        Hide
        mikemccand Michael McCandless added a comment -

        Here's another patch ... I think it's nearly ready.

        I added "nearest neighbor" to the luceneutil London, UK geo benchmark, and getting top 10 nearest hits from the center of each of the 225 boxes that benchmark tests is extremely fast ... less than 0.5 msec for each call ... KD trees are very good for this

        This adds LatLonPoint.nearest, and it now takes an int topN (vs. single nearest point in previous patches).

        I think it solves the adversary issues of the previous patches, at least the ones that are solvable. Some are not! Imagine the index only contains many points indexed from a circle, and then you ask for nearest from its center ... no matter your algorithm, that will amount to a linear scan.

        There's one wrinkle though: the search works directly with BKDReader, because it's a best first search across all segments, picking which cell (across all segments) is the most "promising" to descend into next. This means it requires that you use Lucene60PointsFormat (the default codec), and it gets the underlying BKDReader from that. I think making this work only through the codec APIs is ... tricky.

        There are further optimizations we could do like using haversinSortKey, or maybe fixing the approxBestDistance to be a true best distance (I need heavy geo math help for that!), and using Lucene's priority queue not the JDK's, but these all can come later.

        Show
        mikemccand Michael McCandless added a comment - Here's another patch ... I think it's nearly ready. I added "nearest neighbor" to the luceneutil London, UK geo benchmark, and getting top 10 nearest hits from the center of each of the 225 boxes that benchmark tests is extremely fast ... less than 0.5 msec for each call ... KD trees are very good for this This adds LatLonPoint.nearest , and it now takes an int topN (vs. single nearest point in previous patches). I think it solves the adversary issues of the previous patches, at least the ones that are solvable. Some are not! Imagine the index only contains many points indexed from a circle, and then you ask for nearest from its center ... no matter your algorithm, that will amount to a linear scan. There's one wrinkle though: the search works directly with BKDReader , because it's a best first search across all segments, picking which cell (across all segments) is the most "promising" to descend into next. This means it requires that you use Lucene60PointsFormat (the default codec), and it gets the underlying BKDReader from that. I think making this work only through the codec APIs is ... tricky. There are further optimizations we could do like using haversinSortKey , or maybe fixing the approxBestDistance to be a true best distance (I need heavy geo math help for that!), and using Lucene's priority queue not the JDK's, but these all can come later.
        Hide
        rcmuir Robert Muir added a comment -

        Do we really need NearestHit? I dont think we should add this class.

        We created the FieldDoc nightmare, we should live with it? Distance sort already has to live with its crappiness.

        We should just fix that separately so its less horrible, but i dont think inventing extra classes for different use cases is a better solution?

        Show
        rcmuir Robert Muir added a comment - Do we really need NearestHit? I dont think we should add this class. We created the FieldDoc nightmare, we should live with it? Distance sort already has to live with its crappiness. We should just fix that separately so its less horrible, but i dont think inventing extra classes for different use cases is a better solution?
        Hide
        rcmuir Robert Muir added a comment -

        Otherwise, lets get it in. I just have those concerns about the public part. The rest of it we can improve later.

        its the sandbox.

        Show
        rcmuir Robert Muir added a comment - Otherwise, lets get it in. I just have those concerns about the public part. The rest of it we can improve later. its the sandbox.
        Hide
        rcmuir Robert Muir added a comment -

        Instead of:

        NearestHit[] nearest(IndexReader r, String field, double latitude, double longitude, int topN)
        

        If we have:

        TopDocs nearest(IndexSearcher searcher, String field, double latitude, double longitude, int n)
        

        Then it can work even if BKD is not present too. We could do a slower sort. I am not sure we should but it makes the api more consistent with others.

        Show
        rcmuir Robert Muir added a comment - Instead of: NearestHit[] nearest(IndexReader r, String field, double latitude, double longitude, int topN) If we have: TopDocs nearest(IndexSearcher searcher, String field, double latitude, double longitude, int n) Then it can work even if BKD is not present too. We could do a slower sort. I am not sure we should but it makes the api more consistent with others.
        Hide
        mikemccand Michael McCandless added a comment -

        Good ideas Robert Muir, here's a new patch cutting over to TopFieldDocs.

        Show
        mikemccand Michael McCandless added a comment - Good ideas Robert Muir , here's a new patch cutting over to TopFieldDocs .
        Hide
        rcmuir Robert Muir added a comment -

        Thanks for fixing the api!

        One nit, i did suggest int n for a reason, that is to be consistent with IndexSearcher.search(). The patch currently has topN. I see no reason why they should be different Parameter names are important esp when there are no @param to explain them (this can be fixed separately, lets just get the thing in).

        Show
        rcmuir Robert Muir added a comment - Thanks for fixing the api! One nit, i did suggest int n for a reason, that is to be consistent with IndexSearcher.search(). The patch currently has topN. I see no reason why they should be different Parameter names are important esp when there are no @param to explain them (this can be fixed separately, lets just get the thing in).
        Hide
        mikemccand Michael McCandless added a comment -

        Woops sorry I agree we should name the params consistently ... I'll change to n before pushing.

        Show
        mikemccand Michael McCandless added a comment - Woops sorry I agree we should name the params consistently ... I'll change to n before pushing.
        Hide
        dsmiley David Smiley added a comment -

        This is awesome! Nice work Mike. I very much like Rob's suggestion of using TopDocs instead of creating NearestHit.

        It would be even better, a future issue, to further filter the topN selection by those docs in a Bits doc set. Of course that complicates the task quite a bit but it's much more useful to use cases that already have other filters in-play. I've got some code for that against an RPT index somewhere.

        Show
        dsmiley David Smiley added a comment - This is awesome! Nice work Mike. I very much like Rob's suggestion of using TopDocs instead of creating NearestHit. It would be even better, a future issue, to further filter the topN selection by those docs in a Bits doc set. Of course that complicates the task quite a bit but it's much more useful to use cases that already have other filters in-play. I've got some code for that against an RPT index somewhere.
        Hide
        rcmuir Robert Muir added a comment -

        It would be even better, a future issue, to further filter the topN selection by those docs in a Bits doc set. Of course that complicates the task quite a bit but it's much more useful to use cases that already have other filters in-play. I've got some code for that against an RPT index somewhere.

        I don't think its about complication: i think this would not scale? This easily degrades to linear scan.

        Show
        rcmuir Robert Muir added a comment - It would be even better, a future issue, to further filter the topN selection by those docs in a Bits doc set. Of course that complicates the task quite a bit but it's much more useful to use cases that already have other filters in-play. I've got some code for that against an RPT index somewhere. I don't think its about complication: i think this would not scale? This easily degrades to linear scan.
        Hide
        rcmuir Robert Muir added a comment -

        And if you really want to have complex logic like filters, you can always just do a search and sort by distance.

        Show
        rcmuir Robert Muir added a comment - And if you really want to have complex logic like filters, you can always just do a search and sort by distance.
        Hide
        dsmiley David Smiley added a comment -

        If the Bits (doc set) filter is sparsely populated, then I agree – you're better off just doing a typical sort. If I recall (it's been years now), my implementation did that. But if it's heavily populated, then I picked a fairly large grid cell (possibly including some adjacent cells if the query point was close to an edge) and did a distance sort with that rectangle a filter. If I not only met the topN budget but the furthest distance in that topN was close enough such that, if I inscribed a circle from the query point to that last point, that the circle would be completely within the cells making the filter, then I'm done. If unlucky, I had to iterate up to larger cells/filters or give up and fall back to a standard distance sort. I recall this made a world of difference to an app that previously always distance sorted. To be clear, to the extent that this used the grid cells, it was only as an approximate filter. the points were always in DocValues and the distance was computed from that.

        Show
        dsmiley David Smiley added a comment - If the Bits (doc set) filter is sparsely populated, then I agree – you're better off just doing a typical sort. If I recall (it's been years now), my implementation did that. But if it's heavily populated, then I picked a fairly large grid cell (possibly including some adjacent cells if the query point was close to an edge) and did a distance sort with that rectangle a filter. If I not only met the topN budget but the furthest distance in that topN was close enough such that, if I inscribed a circle from the query point to that last point, that the circle would be completely within the cells making the filter, then I'm done. If unlucky, I had to iterate up to larger cells/filters or give up and fall back to a standard distance sort. I recall this made a world of difference to an app that previously always distance sorted. To be clear, to the extent that this used the grid cells, it was only as an approximate filter. the points were always in DocValues and the distance was computed from that.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f6c7fc7a26584c92a81b3a6cbca179ca232808a9 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6c7fc7 ]

        LUCENE-7069: add LatLonPoint.nearest to find N nearest points

        Show
        jira-bot ASF subversion and git services added a comment - Commit f6c7fc7a26584c92a81b3a6cbca179ca232808a9 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6c7fc7 ] LUCENE-7069 : add LatLonPoint.nearest to find N nearest points
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5bbb5e7748f1191d1a507adc7e69bb25178a46dd in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5bbb5e7 ]

        LUCENE-7069: add LatLonPoint.nearest to find N nearest points

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5bbb5e7748f1191d1a507adc7e69bb25178a46dd in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5bbb5e7 ] LUCENE-7069 : add LatLonPoint.nearest to find N nearest points
        Hide
        rcmuir Robert Muir added a comment -

        like i say, worst case linear scan

        I just don't think we should add trappy/slow operations when there are still so many traps in basic functionality. And using a distance sort is fine for use cases like that (where unlike deleted docs, filtering is not "bound" by merging or some other process, its arbitrary). It is already designed to handle that situation, prevent any trappy/crazy worst-cases, etc.

        Show
        rcmuir Robert Muir added a comment - like i say, worst case linear scan I just don't think we should add trappy/slow operations when there are still so many traps in basic functionality. And using a distance sort is fine for use cases like that (where unlike deleted docs, filtering is not "bound" by merging or some other process, its arbitrary). It is already designed to handle that situation, prevent any trappy/crazy worst-cases, etc.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5238de937a84c4de387f0036830811cb3b7d734f in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5238de9 ]

        LUCENE-7069: woops, approxBestDistance was way too approximate when the point was inside the cell

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5238de937a84c4de387f0036830811cb3b7d734f in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5238de9 ] LUCENE-7069 : woops, approxBestDistance was way too approximate when the point was inside the cell
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 41da63eee3c54ff8f1c9de09d4c03ec1ffb83f3f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=41da63e ]

        LUCENE-7069: woops, approxBestDistance was way too approximate when the point was inside the cell

        Show
        jira-bot ASF subversion and git services added a comment - Commit 41da63eee3c54ff8f1c9de09d4c03ec1ffb83f3f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=41da63e ] LUCENE-7069 : woops, approxBestDistance was way too approximate when the point was inside the cell
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f6c7fc7a26584c92a81b3a6cbca179ca232808a9 in lucene-solr's branch refs/heads/jira/SOLR-8908 from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6c7fc7 ]

        LUCENE-7069: add LatLonPoint.nearest to find N nearest points

        Show
        jira-bot ASF subversion and git services added a comment - Commit f6c7fc7a26584c92a81b3a6cbca179ca232808a9 in lucene-solr's branch refs/heads/jira/ SOLR-8908 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6c7fc7 ] LUCENE-7069 : add LatLonPoint.nearest to find N nearest points
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5238de937a84c4de387f0036830811cb3b7d734f in lucene-solr's branch refs/heads/jira/SOLR-8908 from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5238de9 ]

        LUCENE-7069: woops, approxBestDistance was way too approximate when the point was inside the cell

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5238de937a84c4de387f0036830811cb3b7d734f in lucene-solr's branch refs/heads/jira/ SOLR-8908 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5238de9 ] LUCENE-7069 : woops, approxBestDistance was way too approximate when the point was inside the cell
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 68e9efc7ac0b00ec8bcc03b52ade73bb3b74d707 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=68e9efc ]

        LUCENE-7069: can't wrap with exotic readers when making IndexSearcher

        Show
        jira-bot ASF subversion and git services added a comment - Commit 68e9efc7ac0b00ec8bcc03b52ade73bb3b74d707 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=68e9efc ] LUCENE-7069 : can't wrap with exotic readers when making IndexSearcher
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e2c451e6f2c07b128968a95682c626aa61d3a461 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e2c451e ]

        LUCENE-7069: can't wrap with exotic readers when making IndexSearcher

        Show
        jira-bot ASF subversion and git services added a comment - Commit e2c451e6f2c07b128968a95682c626aa61d3a461 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e2c451e ] LUCENE-7069 : can't wrap with exotic readers when making IndexSearcher
        Hide
        hossman Hoss Man added a comment - - edited

        Manually correcting fixVersion per Step #S5 of LUCENE-7271

        Show
        hossman Hoss Man added a comment - - edited Manually correcting fixVersion per Step #S5 of LUCENE-7271

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development