Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
New
Description
Current GeoPointField LUCENE-6450 computes a set of ranges along the space-filling curve that represent a provided bounding box. This determines which terms to visit in the terms dictionary and which to skip. This is suboptimal for large bounding boxes as we may end up visiting all terms (which could be quite large).
This incremental improvement is to improve GeoPointField to only visit high precision terms in boundary ranges and use the postings list for ranges that are completely within the target bounding box.
A separate improvement is to switch over to auto-prefix and build an Automaton representing the bounding box. That can be tracked in a separate issue.
Attachments
Attachments
- LUCENE-6481.patch
- 46 kB
- Michael McCandless
- LUCENE-6481.patch
- 45 kB
- Nick Knize
- LUCENE-6481.patch
- 50 kB
- Nick Knize
- LUCENE-6481.patch
- 49 kB
- Nick Knize
- LUCENE-6481.patch
- 49 kB
- Nick Knize
- LUCENE-6481.patch
- 50 kB
- Nick Knize
- LUCENE-6481.patch
- 30 kB
- Nick Knize
- LUCENE-6481.patch
- 74 kB
- Nick Knize
- LUCENE-6481_WIP.patch
- 37 kB
- Nick Knize
Issue Links
- duplicates
-
LUCENE-6450 Add simple encoded GeoPointField type to core
- Closed
Activity
Commit 1684286 from mikemccand in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1684286 ]
LUCENE-6481: add simple API to index geo lat/lon points and search by bounding box or polygon
Commit 1684285 from mikemccand in branch 'dev/trunk'
[ https://svn.apache.org/r1684285 ]
LUCENE-6481: add simple API to index geo lat/lon points and search by bounding box or polygon
For some reason a diff with the latest branch introduced a lot of duplicate changes
Oh, sorry ... the branch was supposed to make things easier. But a trunk patch is fine too.
I like testPacManPolyQuery!
We can remove that NOTE in GeoPointField javadocs right?
Else, this looks great, I'll commit soon! Thanks nknize.
For some reason a diff with the latest branch introduced a lot of duplicate changes so this is the latest patch off trunk.
This patch resolves all no commits, including:
- random polygon testing
- thread safety testing
- added tolerance to expectation check in random test
- beast tested w/ 500 iterations
Commit 1683615 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1683615 ]
LUCENE-6481: merge trunk
Commit 1682453 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1682453 ]
LUCENE-6481: Nick's latest patch: create range terms once per query, not per segment; check cell intersection against polygon not its bbox for more restrictive recursing
Updates:
- cache ranges across segments
- only add ranges that are either within or cross the boundary of the bbox or polygon
In exotic cases this latter fix drastically reduces the number of ranges added since it avoids unnecessary exterior cells that only "touch" the boundary. The downside is since the random test doesn't currently use the TOLERANCE criteria it occasionally fails due computation error at 1e-7 precision. This can be tweaked in the next patch.
Commit 1682043 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1682043 ]
LUCENE-6481: add nocommit about query slowness for large bboxes; improve javadocs
I fixed the randomized test case to only test a "smallish" global bbox (up to +/- 1.5 in lat and lon directions)... when testing the full space the test runs very very slowly, even testRandomTiny, because it can require 100s of K terms ... I'm not quite sure why but the query classes do advertise that they should be used on smallish rectangles.
Also, the test fails because of boundary cases:
T0: iter=57 id=6819 docID=6723 lat=-81.12143028547105 lon=-168.98618510239544 (bbox: lat=-81.22948018485512 TO -80.9313958433031 lon=-168.98618505380117 TO -168.77958782241174) expected true but got: false deleted?=false query=GeoPointInBBoxQuery: field=geoField: Lower Left: [-168.98618505380117,-81.22948018485512] Upper Right: [-168.77958782241174,-80.9313958433031] máj 27, 2015 2:16:52 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException WARNING: Uncaught exception in thread: Thread[T0,5,TGRP-TestGeoPointQuery] java.lang.AssertionError: wrong result at __randomizedtesting.SeedInfo.seed([4B5245DED09AE592]:0) at org.junit.Assert.fail(Assert.java:93) at org.apache.lucene.search.TestGeoPointQuery$1._run(TestGeoPointQuery.java:380) at org.apache.lucene.search.TestGeoPointQuery$1.run(TestGeoPointQuery.java:279)
It's odd because in GeoUtils#bboxContains, we do try to take the quantization into account ...
Commit 1682036 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1682036 ]
LUCENE-6481: restrict random test cases to 'smallish' bbox; switch static factory for polygon query to ctor
Commit 1682029 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1682029 ]
LUCENE-6481: validate lat/lon passed to the queries
I committed a small improvement to the test, to sometimes test on a smallish random sized bbox, and it hits this new failure:
Woops, my bad: test bug. I committed a fix...
Commit 1681943 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681943 ]
LUCENE-6481: woops
Commit 1681942 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681942 ]
LUCENE-6481: add nocommit
I committed a small improvement to the test, to sometimes test on a smallish random sized bbox, and it hits this new failure:
[junit4] Started J0 PID(15031@localhost). [junit4] Suite: org.apache.lucene.search.TestGeoPointQuery [junit4] 1> smallBBox=true [junit4] 1> T0: iter=1 id=3 docID=3 lat=-90.2957562284163 lon=-179.83102465432302 (bbox: lat=-90.84003988710097 TO -89.86198542441215 lon=-180.5339936365967 TO -179.31722107698667) expected true but got: false deleted?=false query=GeoPointInBBoxQuery: field=geoField: Lower Left: [-180.5339936365967,-90.84003988710097] Upper Right: [-179.31722107698667,-89.86198542441215] [junit4] 2> ??? 27, 2015 12:37:04 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[T0,5,TGRP-TestGeoPointQuery] [junit4] 2> java.lang.AssertionError: wrong result [junit4] 2> at __randomizedtesting.SeedInfo.seed([5]:0) [junit4] 2> at org.junit.Assert.fail(Assert.java:93) [junit4] 2> at org.apache.lucene.search.TestGeoPointQuery$1._run(TestGeoPointQuery.java:372) [junit4] 2> at org.apache.lucene.search.TestGeoPointQuery$1.run(TestGeoPointQuery.java:271) [junit4] 2> [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testRandomTiny -Dtests.seed=5 -Dtests.locale=sr_ME -Dtests.timezone=Europe/Vilnius -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.09s | TestGeoPointQuery.testRandomTiny <<<
I'm not sure why testRandomTiny is hitting all these problems!
Commit 1681941 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681941 ]
LUCENE-6481: sometimes use smallish bbox for random test
Commit 1681855 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681855 ]
LUCENE-6481: go back to 1 thread, no poly query, until we track down testRandomTiny failures
Commit 1681853 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681853 ]
LUCENE-6481: add back lost GeoPointInPolygonQuery.isWithin override
I made test a bit more evil (use threads, sometimes randomly do poly query), but I hit this:
[junit4] Started J0 PID(12760@localhost). [junit4] Suite: org.apache.lucene.search.TestGeoPointQuery [junit4] OK 0.04s | TestGeoPointQuery.testBBoxQuery [junit4] 2> de maig 26, 2015 3:30:16 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[T0,5,TGRP-TestGeoPointQuery] [junit4] 2> java.lang.OutOfMemoryError: Java heap space [junit4] 2> at __randomizedtesting.SeedInfo.seed([E57ABD1CCC7F4388]:0) [junit4] 2> at java.util.LinkedList.toArray(LinkedList.java:1050) [junit4] 2> at java.util.List.sort(List.java:477) [junit4] 2> at java.util.Collections.sort(Collections.java:141) [junit4] 2> at org.apache.lucene.search.GeoPointInBBoxQuery$GeoBBoxTermsEnum.<init>(GeoPointInBBoxQuery.java:157) [junit4] 2> at org.apache.lucene.search.GeoPointInBBoxQuery.getTermsEnum(GeoPointInBBoxQuery.java:86) [junit4] 2> at org.apache.lucene.search.MultiTermQuery.getTermsEnum(MultiTermQuery.java:345) [junit4] 2> at org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:146) [junit4] 2> at org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.scorer(MultiTermQueryConstantScoreWrapper.java:212) [junit4] 2> at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:585) [junit4] 2> at org.apache.lucene.search.Weight.bulkScorer(Weight.java:137) [junit4] 2> at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:72) [junit4] 2> at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:72) [junit4] 2> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560) [junit4] 2> at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:93) [junit4] 2> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367) [junit4] 2> at org.apache.lucene.search.TestGeoPointQuery$1._run(TestGeoPointQuery.java:337) [junit4] 2> at org.apache.lucene.search.TestGeoPointQuery$1.run(TestGeoPointQuery.java:265) [junit4] 2> [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testRandomTiny -Dtests.seed=E57ABD1CCC7F4388 -Dtests.locale=ca_ES -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 49.3s | TestGeoPointQuery.testRandomTiny <<<
It's curious because the testRandomTiny only indexes a few 10s of points...
Commit 1681849 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681849 ]
LUCENE-6481: enable threads and random poly query in test
Commit 1681842 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681842 ]
LUCENE-6481: make explicit separate methods for the lat vs lon cases
Commit 1681836 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681836 ]
LUCENE-6481: fix some ant precommit issues; improve test (use threads; randomly test poly query too), but currently disabled; add validation of the incoming lat/lon to poly query; improve query toStrings a bit
Commit 1681828 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681828 ]
LUCENE-6481: fix some precommit issues
Commit 1681827 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681827 ]
LUCENE-6481: woops
I made a branch so we can iterate a bit: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-6481
I'll make some small fixes to satisfy "ant precommit", and fold some of the improvements to the random test from LUCENE-6477
Commit 1681821 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681821 ]
LUCENE-6481: commit latest patch
Commit 1681820 from mikemccand in branch 'dev/branches/LUCENE-6481'
[ https://svn.apache.org/r1681820 ]
LUCENE-6481: make branch
You can use the postings when the cell is wholly contained within the polygon, which wasn't in that last patch. New patch attached to include this logic.
Boundary cells are still computed as they relate to the bbox. This could be improved by computing boundary cells as they relate to the shape as long as computing the existence of an intersection of the bbox and shape is fast - this is usually the Achilles heel of spatial relations.
I was curious about the polygon query performance, so I tweaked the "bboxes around London, UK" benchmark to just use a polygon query instead (with 5 points, first and last are the same to the polygon is close), and surprisingly the performance is only a bit slower than the bbox case (19.1 msec vs 17.8 msec) ... I expected much slower because in the polygon case we cannot use the prefix terms, I think?
Thanks nknize, tests pass for me now, and I ran the same "bboxes around London, UK" perf test and this is much faster than before (= LUCENE-6450): I now see ~17.8 msec avg per query (~2X faster than GeoHashPrefixTree I think).
I'll have a closer look at the patch ... I really want to make a visualization here just like the BKD tree ones (https://plus.google.com/+MichaelMcCandless/posts/Daj9FgYPdtv) to see how the morton-shapes "work" in filling in a polygon...
I think there are fun things we can explore (future!) to speed things up further, e.g. if we also index lat/lon into doc values, then we can use that for filtering, freeing us to also use prefix terms on boundary shapes, and also maybe freeing us to use encodings like Hilbert curves which should give better locality / able to use prefix terms more frequently since you would no longer need a fast decode from term -> lat/lon. But, it would use more disk space... we can also integrate with geo3d so we get shape intersection for faster polygon querying (now it must filter every point?). Later!
Thanks Mike! Not sure how I missed my own test. Trivial fix though, new patch added and all tests are passing on my end. Next iteration ready for review. This should be ready for sandbox commit blessings.
Ooh, awesome progress! I'll look in more detail, but: I applied the patch and hit this test failure:
ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testBBoxQuery -Dtests.seed=FC153607634373BA -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/lucenedata/hudson.enwiki.random.lines.txt.fixed -Dtests.locale=es -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] Started J0 PID(14260@localhost). [junit4] Suite: org.apache.lucene.search.TestGeoPointQuery [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testBBoxQuery -Dtests.seed=FC153607634373BA -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/lucenedata/hudson.enwiki.random.lines.txt.fixed -Dtests.locale=es -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.09s | TestGeoPointQuery.testBBoxQuery <<< [junit4] > Throwable #1: java.lang.AssertionError: GeoBoundingBoxQuery failed expected:<1> but was:<2> [junit4] > at __randomizedtesting.SeedInfo.seed([FC153607634373BA:5CE7EC6FB0DC9159]:0) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.testBBoxQuery(TestGeoPointQuery.java:113) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /l/nick/lucene/build/sandbox/test/J0/temp/lucene.search.TestGeoPointQuery FC153607634373BA-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene50): {geoField=PostingsFormat(name=MockRandom)}, docValues:{}, sim=DefaultSimilarity, locale=es, timezone=Asia/Jerusalem [junit4] 2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=385990872,total=504889344 [junit4] 2> NOTE: All tests run in this JVM: [TestGeoPointQuery] [junit4] Completed [1/1] in 0.48s, 1 test, 1 failure <<< FAILURES!
This patch is ready to go. In fact, this performance improvement should supersede the existing patch in LUCENE-6450.
Updated patch to fix false negatives. This now improves performance of LUCENE-6450 to 0.02 sec / query by using the postings list instead of visiting every term.
Minor patch update that adds geodesic to geodetic projection / reprojection methods to GeoUtils.
The test had the lat and lon ordering incorrect for both GeoPointFieldType and the GeoPointInBBoxQuery.
Oh, woops Thanks for fixing1
The test had the lat and lon ordering incorrect for both GeoPointFieldType and the GeoPointInBBoxQuery. I've attached a new patch with the correction.
testRandomTiny passes but there is one failure in testRandom with the following:
ant test -Dtestcase=TestGeoPointQuery -Dtestmethod=testRandom -Dtests.seed=F1E43F53709BFF82 -Dtests.verbose=true
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testRandom -Dtests.seed=F1E43F53709BFF82 -Dtests.slow=true -Dtests.locale=en_US -Dtests.timezone=Africa/Lome -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 1.54s | TestGeoPointQuery.testRandom <<< [junit4] > Throwable #1: java.lang.AssertionError: id=632 docID=613 lat=46.19240875459866 lon=143.92476891121902 expected true but got: false deleted?=false [junit4] > at __randomizedtesting.SeedInfo.seed([F1E43F53709BFF82:83A81A5CC1FB49F1]:0) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.verify(TestGeoPointQuery.java:302) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.doTestRandom(TestGeoPointQuery.java:204) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.testRandom(TestGeoPointQuery.java:130) [junit4] > at java.lang.Thread.run(Thread.java:745)
This should be enough to debug the issue. I expect to have a new patch sometime tomorrow or before weeks end.
New patch, starting from nknize's and then folding in the evilish random test I added for LUCENE-6477 ... maybe this can help debug why there are false negatives?
E.g. with this patch when I run:
ant test -Dtestcase=TestGeoPointQuery -Dtestmethod=testRandomTiny -Dtests.seed=F1E43F53709BFF82 -Dtests.verbose=true
It fails with this:
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testRandomTiny -Dtests.seed=F1E43F53709BFF82 -Dtests.locale=en_US -Dtests.timezone=Africa/Lome -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 2.91s | TestGeoPointQuery.testRandomTiny <<< [junit4] > Throwable #1: java.lang.AssertionError: id=0 docID=0 lat=-27.180279388889545 lon=-167.14191331870592 expected true but got: false deleted?=false [junit4] > at __randomizedtesting.SeedInfo.seed([F1E43F53709BFF82:B8A3E1152EBAC72E]:0) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.verify(TestGeoPointQuery.java:301) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.doTestRandom(TestGeoPointQuery.java:203) [junit4] > at org.apache.lucene.search.TestGeoPointQuery.testRandomTiny(TestGeoPointQuery.java:125) [junit4] > at java.lang.Thread.run(Thread.java:745)
The test case should be easy-ish to debug: it only indexes at most a few 10s of points...
First cut WIP patch. LuceneUtil benchmark shows false negatives, though, so this is definitely not ready. So far I've been unable to reproduce the false negatives...I put it here for iterating improvements.
GeoPointField
Index Time: 640.24 sec
Index Size: 4.4G
Mean Query Time: 0.02 sec
This issue was moved to GitHub issue: #7540.