Lucene - Core
  1. Lucene - Core
  2. LUCENE-6547

Add dateline crossing support to GeoPointInBBox and GeoPointDistance Queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current GeoPointInBBoxQuery only supports bounding boxes that are within the standard -180:180 longitudinal bounds. While its perfectly fine to require users to split dateline crossing bounding boxes in two, GeoPointDistanceQuery should support distance queries that cross the dateline. Since morton encoding doesn't support unwinding this issue will add dateline crossing to GeoPointInBBoxQuery and GeoPointDistanceQuery classes.

      1. LUCENE-6547.patch
        93 kB
        Nicholas Knize
      2. LUCENE-6547.patch
        97 kB
        Nicholas Knize
      3. LUCENE-6547.patch
        85 kB
        Michael McCandless
      4. LUCENE-6547.patch
        75 kB
        Nicholas Knize
      5. LUCENE-6547.patch
        69 kB
        Nicholas Knize
      6. LUCENE-6547.patch
        69 kB
        Nicholas Knize
      7. LUCENE-6547.patch
        55 kB
        Nicholas Knize
      8. LUCENE-6547.patch
        56 kB
        Nicholas Knize
      9. LUCENE-6547.patch
        51 kB
        Nicholas Knize

        Activity

        Hide
        Nicholas Knize added a comment -

        Initial patch (diff from trunk) that adds dateline crossing support to GeoPointInBBoxQuery. Randomized testing not yet updated to beast test this feature and still need to add dateline crossing support to GeoPointDistanceQuery.

        Show
        Nicholas Knize added a comment - Initial patch (diff from trunk) that adds dateline crossing support to GeoPointInBBoxQuery. Randomized testing not yet updated to beast test this feature and still need to add dateline crossing support to GeoPointDistanceQuery.
        Hide
        Nicholas Knize added a comment -

        This turned out to not be as difficult as originally thought. This update is ready for review and should probably preempt LUCENE-6532. Patch includes:

        • Dateline crossing support for GeoPointInBBoxQuery and GeoPointDistanceQuery (GeoPointDistanceQuery uses ECEF Cartesian System for secant checks)
        • Updated unit test to include randomized testing of GeoPointDistanceQuery
        • Split GeoUtils into separate GeoDistanceUtils, GeoProjectionUtils, and GeoUtils class files
        • Successfully Beast tested several times using 100 iterations
        • Updated javadocs to include known limitations and performance implications for dateline crossing queries.

        All future performance improvements can be done in separate enhancement issues.

        Show
        Nicholas Knize added a comment - This turned out to not be as difficult as originally thought. This update is ready for review and should probably preempt LUCENE-6532 . Patch includes: Dateline crossing support for GeoPointInBBoxQuery and GeoPointDistanceQuery (GeoPointDistanceQuery uses ECEF Cartesian System for secant checks) Updated unit test to include randomized testing of GeoPointDistanceQuery Split GeoUtils into separate GeoDistanceUtils, GeoProjectionUtils, and GeoUtils class files Successfully Beast tested several times using 100 iterations Updated javadocs to include known limitations and performance implications for dateline crossing queries. All future performance improvements can be done in separate enhancement issues.
        Hide
        Nicholas Knize added a comment -

        Removed unnecessary haversine computations from rectCrossesCircle. This is based on comments/discussion in LUCENE-6532.

        Show
        Nicholas Knize added a comment - Removed unnecessary haversine computations from rectCrossesCircle. This is based on comments/discussion in LUCENE-6532 .
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize!

        Looks like we no longer get exc if we pass invalid lat or lon to
        GeoPointInBBoxQuery? Can we get that back (e.g. if I pass
        lat=1000.0).

        Can we revert the changes to GeoPointInPolygonQuery now (since
        GeoPointDistanceQuery now extends GeoPointInBBoxQuery instead)?

        Seems like GeoPointTermsEnum.min/maxLat could still be final?

        Hmm in GeoPointTermsEnum, for the dateline split case, I'm worried how
        we overwrite minLon/maxLon, because in the accept method we then check
        minLon/maxLon but at that point it will be from the 2nd bbox? Isn't
        this possibly wrong / buggy (why is test not failing)?

        Another way to handle this "crosses dateline" would be to use the
        rewrite method? Ie, rewrite to BQ, SHOULD'ing the 2 queries? It
        could be simpler that way? Then the query itself, terms enum, etc.,
        don't have to deal with this case.

        Do we still need circleToPoly? It's unused now right? Hmm the
        javadocs for GeoPointDistanceQuery are stale?

        Can we fix the random test to more strongly separate out the cases its
        testing? E.g. make a separate "tolerateRadiusIgnorance" and
        "tolerateBBoxIgnorance" methods, make separate code for verifying the
        radius vs bbox/poly cases, etc.? We can e.g. factor out a
        "getMatchingDocs" method that returns the FixedBitSet.

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ! Looks like we no longer get exc if we pass invalid lat or lon to GeoPointInBBoxQuery? Can we get that back (e.g. if I pass lat=1000.0). Can we revert the changes to GeoPointInPolygonQuery now (since GeoPointDistanceQuery now extends GeoPointInBBoxQuery instead)? Seems like GeoPointTermsEnum.min/maxLat could still be final? Hmm in GeoPointTermsEnum, for the dateline split case, I'm worried how we overwrite minLon/maxLon, because in the accept method we then check minLon/maxLon but at that point it will be from the 2nd bbox? Isn't this possibly wrong / buggy (why is test not failing)? Another way to handle this "crosses dateline" would be to use the rewrite method? Ie, rewrite to BQ, SHOULD'ing the 2 queries? It could be simpler that way? Then the query itself, terms enum, etc., don't have to deal with this case. Do we still need circleToPoly? It's unused now right? Hmm the javadocs for GeoPointDistanceQuery are stale? Can we fix the random test to more strongly separate out the cases its testing? E.g. make a separate "tolerateRadiusIgnorance" and "tolerateBBoxIgnorance" methods, make separate code for verifying the radius vs bbox/poly cases, etc.? We can e.g. factor out a "getMatchingDocs" method that returns the FixedBitSet.
        Hide
        Michael McCandless added a comment -

        Hmm does this also fix polygon queries that cross the dateline?

        Show
        Michael McCandless added a comment - Hmm does this also fix polygon queries that cross the dateline?
        Hide
        Michael McCandless added a comment -

        I opened LUCENE-6560 for BKD "point in bbox" query, showing how we could use the rewrite method so the "low level" query doesn't have to see 2 rects.

        Show
        Michael McCandless added a comment - I opened LUCENE-6560 for BKD "point in bbox" query, showing how we could use the rewrite method so the "low level" query doesn't have to see 2 rects.
        Hide
        Nicholas Knize added a comment - - edited

        Updated patch to include review feedback from Michael McCandless:

        • Added back validation check for lat/lons passed to GeoPointInBBoxQuery.
        • Reverted changes made to GeoPointInPolygonQuery.
        • Changed GeoPointTermsEnum.min/maxLat back to final
        • Changed dateline split logic for GeoPointInBBox and GeoPointDistance query to rewrite using BQ SHOULD.
        • Keeping circleToPoly utility method, for now, for testing PointInPolygon query performance (this can maybe go away later?)
        • Updated GeoPointDistanceQuery javadocs

        Can we fix the random test to more strongly separate out the cases its testing?

        Will get this in either the next iteration, or maybe a separate 'improve GeoPointField Testing' issue?

        Show
        Nicholas Knize added a comment - - edited Updated patch to include review feedback from Michael McCandless : Added back validation check for lat/lons passed to GeoPointInBBoxQuery. Reverted changes made to GeoPointInPolygonQuery. Changed GeoPointTermsEnum.min/maxLat back to final Changed dateline split logic for GeoPointInBBox and GeoPointDistance query to rewrite using BQ SHOULD. Keeping circleToPoly utility method, for now, for testing PointInPolygon query performance (this can maybe go away later?) Updated GeoPointDistanceQuery javadocs Can we fix the random test to more strongly separate out the cases its testing? Will get this in either the next iteration, or maybe a separate 'improve GeoPointField Testing' issue?
        Hide
        Nicholas Knize added a comment -

        Updated patch to fix a bounding box range bug.

        Show
        Nicholas Knize added a comment - Updated patch to fix a bounding box range bug.
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize!

        Added back validation check for lat/lons passed to GeoPointInBBoxQuery.

        Thanks, but is this really right?

        I can see it being considered "OK" to pass "invalid" lat (goes over a
        pole) or lon (when it crosses the date line) with the expectation that
        the API normlalizes it? Maybe?

        E.g. it makes it easier for client code to e.g. do bbox search for +/-
        0.5 lat/lon from a given center w/o having to special case the
        dateline itself... or a distance query too.

        In any event, whatever we decide is the "right" behavior, can you add
        "testInvalidXXX" test cases that show we are in fact throwing excs for
        the invalid cases?

        Will get this in either the next iteration, or maybe a separate 'improve GeoPointField Testing' issue?

        OK we can do it separately, but I think it's pretty important to keep
        testRandom approachable: testRandoms have a tendency to become
        disastrous over time (have a look at TestGrouping.testRandom if you
        don't believe me!).

        It seems like there's some dup code, e.g. computeBBox is in both
        GeoPointDistanceQuery and GeoPointDistanceQueryImpl?

        Can you extend testRandom to test date-line wrapping? I did this in
        TestBKDTree so I think for now we can just copy those changes over and
        refactor later?

        I'll look more at the patch ...

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ! Added back validation check for lat/lons passed to GeoPointInBBoxQuery. Thanks, but is this really right? I can see it being considered "OK" to pass "invalid" lat (goes over a pole) or lon (when it crosses the date line) with the expectation that the API normlalizes it? Maybe? E.g. it makes it easier for client code to e.g. do bbox search for +/- 0.5 lat/lon from a given center w/o having to special case the dateline itself... or a distance query too. In any event, whatever we decide is the "right" behavior, can you add "testInvalidXXX" test cases that show we are in fact throwing excs for the invalid cases? Will get this in either the next iteration, or maybe a separate 'improve GeoPointField Testing' issue? OK we can do it separately, but I think it's pretty important to keep testRandom approachable: testRandoms have a tendency to become disastrous over time (have a look at TestGrouping.testRandom if you don't believe me!). It seems like there's some dup code, e.g. computeBBox is in both GeoPointDistanceQuery and GeoPointDistanceQueryImpl? Can you extend testRandom to test date-line wrapping? I did this in TestBKDTree so I think for now we can just copy those changes over and refactor later? I'll look more at the patch ...
        Hide
        Nicholas Knize added a comment -

        Awesome feedback Mike!

        I can see it being considered "OK" to pass "invalid" lat (goes over a pole) or lon (when it crosses the date line) with the expectation that the API normlalizes it?

        That's only for PointDistance searching. If the radius crosses (after adding delta lat/lon) the coordinates are normalized and the Spherical intersection is used. There's a TODO in there to investigate whether its worth adding the math to "split" the circle into semi-circles at dateline and use BQ. I don't think it is.

        ...can you add "testInvalidXXX" test cases that show we are in fact throwing excs for the invalid cases?

        ++ good idea

        It seems like there's some dup code, e.g. computeBBox is in both GeoPointDistanceQuery and GeoPointDistanceQueryImpl?

        There is. I noticed this as a side effect from the refactor. I've removed it and will have that in the next patch (which includes switching PointDistance radius over to meters instead of Km).

        I think it's pretty important to keep testRandom approachable: testRandoms have a tendency to become disastrous over time.

        +++ agree. I can work it in this issue. No need to split into separate issues since this is all sandboxed anyway?

        Can you extend testRandom to test date-line wrapping?

        Yes. This is important. I've got it pretty thoroughly tested in ES, but some beasting is definitely needed.

        Show
        Nicholas Knize added a comment - Awesome feedback Mike! I can see it being considered "OK" to pass "invalid" lat (goes over a pole) or lon (when it crosses the date line) with the expectation that the API normlalizes it? That's only for PointDistance searching. If the radius crosses (after adding delta lat/lon) the coordinates are normalized and the Spherical intersection is used. There's a TODO in there to investigate whether its worth adding the math to "split" the circle into semi-circles at dateline and use BQ. I don't think it is. ...can you add "testInvalidXXX" test cases that show we are in fact throwing excs for the invalid cases? ++ good idea It seems like there's some dup code, e.g. computeBBox is in both GeoPointDistanceQuery and GeoPointDistanceQueryImpl? There is. I noticed this as a side effect from the refactor. I've removed it and will have that in the next patch (which includes switching PointDistance radius over to meters instead of Km). I think it's pretty important to keep testRandom approachable: testRandoms have a tendency to become disastrous over time. +++ agree. I can work it in this issue. No need to split into separate issues since this is all sandboxed anyway? Can you extend testRandom to test date-line wrapping? Yes. This is important. I've got it pretty thoroughly tested in ES, but some beasting is definitely needed.
        Hide
        Michael McCandless added a comment -

        That's only for PointDistance searching.

        Ahh, OK; I wasn't sure when we should tolerate out-of-bounds lat/lon (and do the "wrapping" ourselves) and when we shouldn't.

        No need to split into separate issues since this is all sandboxed anyway?

        OK, thanks. I can help here too (make testRandom more approachable)...

        I took another look at the patch and I think besides these small test issues we should commit it for 5.3; it's already plenty large enough: it adds a new query (GeoPointDistance), it fixes queries to handle dateline wrapping, it refactors the GeoUtils APIs into three sources.

        The last patch doesn't compile now because of the BQ Builder API changes...

        Show
        Michael McCandless added a comment - That's only for PointDistance searching. Ahh, OK; I wasn't sure when we should tolerate out-of-bounds lat/lon (and do the "wrapping" ourselves) and when we shouldn't. No need to split into separate issues since this is all sandboxed anyway? OK, thanks. I can help here too (make testRandom more approachable)... I took another look at the patch and I think besides these small test issues we should commit it for 5.3; it's already plenty large enough: it adds a new query (GeoPointDistance), it fixes queries to handle dateline wrapping, it refactors the GeoUtils APIs into three sources. The last patch doesn't compile now because of the BQ Builder API changes...
        Hide
        Nicholas Knize added a comment -

        Updated patch to include the following:

        • Fixed rewrite methods to use new BQ Builder API changes
        • Removed computeBBox duplicate code
        • Updated randomized test to include larger ranges (this should include the occasional dateline wrapping but I'll add some randomization for explicit dateline wrap testing)
        • Split tolerateIgnorance into two explicit methods for PointDistance and BBox
        • Added geohash string utility methods to GeoUtils to support use-cases using geohash string representations (converts to/from geohash)
        • Added getter utility methods to all three query classes to support third-party refactoring to new GeoPointField type (e.g., elasticsearch)
        Show
        Nicholas Knize added a comment - Updated patch to include the following: Fixed rewrite methods to use new BQ Builder API changes Removed computeBBox duplicate code Updated randomized test to include larger ranges (this should include the occasional dateline wrapping but I'll add some randomization for explicit dateline wrap testing) Split tolerateIgnorance into two explicit methods for PointDistance and BBox Added geohash string utility methods to GeoUtils to support use-cases using geohash string representations (converts to/from geohash) Added getter utility methods to all three query classes to support third-party refactoring to new GeoPointField type (e.g., elasticsearch)
        Hide
        Michael McCandless added a comment -

        Added geohash string utility methods to GeoUtils to support use-cases using geohash string representations (converts to/from geohash)

        Cool! But can we add tests for these new methods?

        Updated randomized test to include larger ranges (this should include the occasional dateline wrapping but I'll add some randomization for explicit dateline wrap testing)

        Wait: these queries can't handle large ranges right? I.e. the query
        (and this test) can now take way too long to run? This is why we
        had to restrict the tested range before...

        Separately, I don't think this means testRandom will sometimes test
        the dateline ... it still has checks to swap lon0/lon1 when lon0 >
        lon1.

        I think we can keep the "tiny testing area", but just randomly locate
        it near the dateline some of the time, and then do the min/max swap
        "in reverse"?

        For now I put back the small testing area ... testing was sometimes
        taking forever ...

        I'll try to tweak the patch to make the randomized test more
        understandable.

        Show
        Michael McCandless added a comment - Added geohash string utility methods to GeoUtils to support use-cases using geohash string representations (converts to/from geohash) Cool! But can we add tests for these new methods? Updated randomized test to include larger ranges (this should include the occasional dateline wrapping but I'll add some randomization for explicit dateline wrap testing) Wait: these queries can't handle large ranges right? I.e. the query (and this test) can now take way too long to run? This is why we had to restrict the tested range before... Separately, I don't think this means testRandom will sometimes test the dateline ... it still has checks to swap lon0/lon1 when lon0 > lon1. I think we can keep the "tiny testing area", but just randomly locate it near the dateline some of the time, and then do the min/max swap "in reverse"? For now I put back the small testing area ... testing was sometimes taking forever ... I'll try to tweak the patch to make the randomized test more understandable.
        Hide
        Michael McCandless added a comment -

        Here's a new patch, starting from Nicholas Knize's last patch, and trying to
        make the verify method more approachable (I really don't want this to
        turn into another TestGrouping.testRandom!)

        In doing this I think I found a couple test bugs:

        • The polygon query test was supposed to only pass the "too big"
          bbox as the bbox args, and the "original" bbox as the polygon
        • tolerateRadiusErrors should not need lat1/lon1? It should only
          need a single point, not two points? How can these
          randomly generated values matter? (The query doesn't get them).
          I left a nocommit in this patch about that ...

        Can we remove the separate bbox from GeoPointInPolygonQuery's ctor?
        This seems very trappy (what if you pass the wrong bbox?), and any
        perf gains should be negligible in the limit (of bigger and bigger
        indices)?

        Show
        Michael McCandless added a comment - Here's a new patch, starting from Nicholas Knize 's last patch, and trying to make the verify method more approachable (I really don't want this to turn into another TestGrouping.testRandom!) In doing this I think I found a couple test bugs: The polygon query test was supposed to only pass the "too big" bbox as the bbox args, and the "original" bbox as the polygon tolerateRadiusErrors should not need lat1/lon1? It should only need a single point, not two points? How can these randomly generated values matter? (The query doesn't get them). I left a nocommit in this patch about that ... Can we remove the separate bbox from GeoPointInPolygonQuery's ctor? This seems very trappy (what if you pass the wrong bbox?), and any perf gains should be negligible in the limit (of bigger and bigger indices)?
        Hide
        Nicholas Knize added a comment -

        Updated patch continuing from Michael McCandless last patch. Updates:

        • Addresses LUCENE-6562 to share ranges across segments. To guard against OOM exceptions the ranges needed to be purged once all segments have been visited. See GeoPointTermQuery line 87 for this check.
        • Removed GeoHash string methods from GeoUtils. I'm going to move this into a separate issue for better tracking and reviewing.
        • Updated javadocs

        Wait: these queries can't handle large ranges right?

        Large ranges were previously discouraged because it takes on the order of ~2secs to correctly compute the ranges for large search areas. Since this occurred for every segment large search over large data was quite time consuming. Reusing ranges across segments has brought this down to the amount of time it takes to compute the ranges.

        Can we remove the separate bbox from GeoPointInPolygonQuery's ctor?

        Yes! I believe so. The intent was to improve performance for detailed polygons (those with many vertices) by providing precomputed or cached bounding boxes rather than having the query recompute. Might be worth benchmarking to be certain whether we want to take away this utility. Sure it can be used maliciously, maybe good documentation can raise awareness?

        Show
        Nicholas Knize added a comment - Updated patch continuing from Michael McCandless last patch. Updates: Addresses LUCENE-6562 to share ranges across segments. To guard against OOM exceptions the ranges needed to be purged once all segments have been visited. See GeoPointTermQuery line 87 for this check. Removed GeoHash string methods from GeoUtils. I'm going to move this into a separate issue for better tracking and reviewing. Updated javadocs Wait: these queries can't handle large ranges right? Large ranges were previously discouraged because it takes on the order of ~2secs to correctly compute the ranges for large search areas. Since this occurred for every segment large search over large data was quite time consuming. Reusing ranges across segments has brought this down to the amount of time it takes to compute the ranges. Can we remove the separate bbox from GeoPointInPolygonQuery's ctor? Yes! I believe so. The intent was to improve performance for detailed polygons (those with many vertices) by providing precomputed or cached bounding boxes rather than having the query recompute. Might be worth benchmarking to be certain whether we want to take away this utility. Sure it can be used maliciously, maybe good documentation can raise awareness?
        Hide
        Michael McCandless added a comment -

        Thanks Nick!

        Addresses LUCENE-6562 to share ranges across segments. To guard against OOM exceptions the ranges needed to be purged once all segments have been visited. See GeoPointTermQuery line 87 for this check.

        OK, it looks like this patch also absorbed the latest patch on
        LUCENE-6562? It's clever how you clear ranges after the last segment!

        Unfortunately I think it's unsafe ... e.g. if IndexSearcher is using
        an executor, multiple threads can call getTermsEnum I think ... and
        also, in the cached case, on opening a new NRT reader, I think the
        query would only see e.g. the 1 or 2 new segments.

        Net/net I'm worried about the complexity of getting this working
        correctly; I think for now it's best/simpler to just re-compute the ranges
        per segment?

        We should anyway try to keep the issue separate from this one...

        Can we remove the separate bbox from GeoPointInPolygonQuery's ctor?

        Yes! I believe so. The intent was to improve performance for detailed polygons (those with many vertices) by providing precomputed or cached bounding boxes rather than having the query recompute. Might be worth benchmarking to be certain whether we want to take away this utility. Sure it can be used maliciously, maybe good documentation can raise awareness?

        But that cost will be miniscule compared to the overall query execution time right?

        E.g. to filter just a single "borderline" hit, we also must walk the full polygon? Multiplied by the number of borderline hits ...

        It just strikes me as such a trivial optimization that it's not worth polluting the API over ... APIs are hard enough as it is And e.g. I've already screwed it up at least once when I passed an invalid bbox before (when working on the test).

        Large ranges were previously discouraged because it takes on the order of ~2secs to correctly compute the ranges for large search areas. Since this occurred for every segment large search over large data was quite time consuming. Reusing ranges across segments has brought this down to the amount of time it takes to compute the ranges.

        OK that's a good improvement but I think large ranges are still very dangerous for this query? E.g. the query will suddenly take ... 10s of MBs heap to hold its ranges as well? Maybe (separately!) we could improve how we store the ranges in RAM, e.g. use two PrefixCodedTerms or something.

        I think, like AutomatonQuery, we need to place a default limit on the number of ranges? User can increase it if they know what they are doing...

        I put this nocommit in my last patch:

        // nocommit why on earth are we needing to pass lat1/lon1 here the query doesn't get these...

        Any idea how to fix? maxLat/maxLon cannot possibly matter here: they are randomly generated and not used by the query, yet they are used by radiusQueryCanBeWrong. This must be a test bug?

        Show
        Michael McCandless added a comment - Thanks Nick! Addresses LUCENE-6562 to share ranges across segments. To guard against OOM exceptions the ranges needed to be purged once all segments have been visited. See GeoPointTermQuery line 87 for this check. OK, it looks like this patch also absorbed the latest patch on LUCENE-6562 ? It's clever how you clear ranges after the last segment! Unfortunately I think it's unsafe ... e.g. if IndexSearcher is using an executor, multiple threads can call getTermsEnum I think ... and also, in the cached case, on opening a new NRT reader, I think the query would only see e.g. the 1 or 2 new segments. Net/net I'm worried about the complexity of getting this working correctly; I think for now it's best/simpler to just re-compute the ranges per segment? We should anyway try to keep the issue separate from this one... Can we remove the separate bbox from GeoPointInPolygonQuery's ctor? Yes! I believe so. The intent was to improve performance for detailed polygons (those with many vertices) by providing precomputed or cached bounding boxes rather than having the query recompute. Might be worth benchmarking to be certain whether we want to take away this utility. Sure it can be used maliciously, maybe good documentation can raise awareness? But that cost will be miniscule compared to the overall query execution time right? E.g. to filter just a single "borderline" hit, we also must walk the full polygon? Multiplied by the number of borderline hits ... It just strikes me as such a trivial optimization that it's not worth polluting the API over ... APIs are hard enough as it is And e.g. I've already screwed it up at least once when I passed an invalid bbox before (when working on the test). Large ranges were previously discouraged because it takes on the order of ~2secs to correctly compute the ranges for large search areas. Since this occurred for every segment large search over large data was quite time consuming. Reusing ranges across segments has brought this down to the amount of time it takes to compute the ranges. OK that's a good improvement but I think large ranges are still very dangerous for this query? E.g. the query will suddenly take ... 10s of MBs heap to hold its ranges as well? Maybe (separately!) we could improve how we store the ranges in RAM, e.g. use two PrefixCodedTerms or something. I think, like AutomatonQuery, we need to place a default limit on the number of ranges? User can increase it if they know what they are doing... I put this nocommit in my last patch: // nocommit why on earth are we needing to pass lat1/lon1 here the query doesn't get these... Any idea how to fix? maxLat/maxLon cannot possibly matter here: they are randomly generated and not used by the query, yet they are used by radiusQueryCanBeWrong. This must be a test bug?
        Hide
        Uwe Schindler added a comment -

        Unfortunately I think it's unsafe ... e.g. if IndexSearcher is using an executor, multiple threads can call getTermsEnum I think ... and also, in the cached case, on opening a new NRT reader, I think the query would only see e.g. the 1 or 2 new segments.

        Yes it is! You would need a ThreadLocal. The better approach may be to use a new RewriteMethod temporarily per Query#rewrite().

        Show
        Uwe Schindler added a comment - Unfortunately I think it's unsafe ... e.g. if IndexSearcher is using an executor, multiple threads can call getTermsEnum I think ... and also, in the cached case, on opening a new NRT reader, I think the query would only see e.g. the 1 or 2 new segments. Yes it is! You would need a ThreadLocal. The better approach may be to use a new RewriteMethod temporarily per Query#rewrite().
        Hide
        Nicholas Knize added a comment -

        Adding updated patch with following changes:

        • reverting unsafe reuse of ranges across segments (save for LUCENE-6562)
        • removed lineCrossesCircle in favor of more accurate (and dateline agnostic) lineCrossesSphere method
        • addressed buggy radiusQueryCanBeWrong
        Show
        Nicholas Knize added a comment - Adding updated patch with following changes: reverting unsafe reuse of ranges across segments (save for LUCENE-6562 ) removed lineCrossesCircle in favor of more accurate (and dateline agnostic) lineCrossesSphere method addressed buggy radiusQueryCanBeWrong
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize! Patch looks good .. it didn't apply to current trunk (I think you need to svn up?), but I resolved it, and fixed a few code style issues ({ } around single-statement ifs), removed some unused AttributeSource atts args. I'll commit soon!

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ! Patch looks good .. it didn't apply to current trunk (I think you need to svn up?), but I resolved it, and fixed a few code style issues ({ } around single-statement ifs), removed some unused AttributeSource atts args. I'll commit soon!
        Hide
        ASF subversion and git services added a comment -

        Commit 1691659 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1691659 ]

        LUCENE-6547: add GeoPointDistanceQuery and fix GeoPointInBBoxQuery to handle dateline crossing

        Show
        ASF subversion and git services added a comment - Commit 1691659 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1691659 ] LUCENE-6547 : add GeoPointDistanceQuery and fix GeoPointInBBoxQuery to handle dateline crossing
        Hide
        ASF subversion and git services added a comment -

        Commit 1691660 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1691660 ]

        LUCENE-6547: add GeoPointDistanceQuery and fix GeoPointInBBoxQuery to handle dateline crossing

        Show
        ASF subversion and git services added a comment - Commit 1691660 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691660 ] LUCENE-6547 : add GeoPointDistanceQuery and fix GeoPointInBBoxQuery to handle dateline crossing
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize!

        The test was very slow with the * 1000 random radius ... I'm not sure why? I decreased that to * 100 and dropped the iters to make it reasonable again.

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ! The test was very slow with the * 1000 random radius ... I'm not sure why? I decreased that to * 100 and dropped the iters to make it reasonable again.
        Hide
        ASF subversion and git services added a comment -

        Commit 1691701 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1691701 ]

        LUCENE-6547: side-step OOME

        Show
        ASF subversion and git services added a comment - Commit 1691701 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1691701 ] LUCENE-6547 : side-step OOME
        Hide
        ASF subversion and git services added a comment -

        Commit 1691702 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1691702 ]

        LUCENE-6547: side-step OOME

        Show
        ASF subversion and git services added a comment - Commit 1691702 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691702 ] LUCENE-6547 : side-step OOME
        Hide
        Nicholas Knize added a comment -

        The test was very slow with the * 1000 random radius ... I'm not sure why?

        There were unnecessary ranges being computed for the GeoPointDistance query. This has been fixed in the latest patch for LUCENE-6685.

        Show
        Nicholas Knize added a comment - The test was very slow with the * 1000 random radius ... I'm not sure why? There were unnecessary ranges being computed for the GeoPointDistance query. This has been fixed in the latest patch for LUCENE-6685 .
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicholas Knize
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development