Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 5.5, trunk
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      GeoPointField currently relies on NumericTokenStream to create prefix terms for a GeoPoint using the precision step defined in GeoPointField. At search time GeoPointTermsEnum recurses to a max precision that is computed by the Query parameters. This max precision is never the full precision, so creating and indexing the full precision terms is useless and wasteful (it was always a side effect of just using indexing logic from the Numeric type).

      Furthermore, since the numerical logic always stored high precision terms first, the recursion in GeoPointTermsEnum required transient memory for storing ranges. By moving the trie logic to its own GeoPointTokenStream and reversing the term order (such that lower resolution terms are first), the GeoPointTermsEnum can naturally traverse, enabling on-demand creation of PrefixTerms. This will be done in a separate issue.

      1. LUCENE-6930.patch
        117 kB
        Nicholas Knize
      2. LUCENE-6930.patch
        115 kB
        Nicholas Knize
      3. LUCENE-6930.patch
        121 kB
        Nicholas Knize
      4. LUCENE-6930.patch
        120 kB
        Nicholas Knize
      5. LUCENE-6930.patch
        100 kB
        Nicholas Knize
      6. LUCENE-6930.patch
        84 kB
        Nicholas Knize

        Activity

        Hide
        Michael McCandless added a comment -

        +1, LegacyNumericType is now deprecated in trunk (to be removed in 7.0), so we should migrate away from it ...

        But we should maybe take this further, once we get all dimensional values based geo queries working well in trunk (e.g. at least DimensionalDistanceQuery and DimensionalDistanceRangeQuery are still missing?) and deprecate the postings based geo queries as well?

        Show
        Michael McCandless added a comment - +1, LegacyNumericType is now deprecated in trunk (to be removed in 7.0), so we should migrate away from it ... But we should maybe take this further, once we get all dimensional values based geo queries working well in trunk (e.g. at least DimensionalDistanceQuery and DimensionalDistanceRangeQuery are still missing?) and deprecate the postings based geo queries as well?
        Hide
        Nicholas Knize added a comment -

        Attached patch adds the following changes:

        • Adds a new GeoPointTokenStream that encodes GeoPointField type into a maximum of 5 "prefix only" terms. Full precision post filtering uses DocValues.
        • Adds a GeoPointTermQuery.TermEncoding enum to choose between existing NUMERIC encoding or new PREFIX encoding
        • Refactors GeoPointTermsEnum into a base class with two derived classes:
          • GeoPointNumericTermsEnum - Uses existing NumericTokenStream logic
          • GeoPointPrefixTermsEnum - Uses new GeoPointTokenStream logic
        • Refactors term encoding logic out of GeoUtils into a dedicated GeoEncodingUtils class
        • Adds randomTermEncoding to TestGeoPointQuery to randomly test both encoding approaches

        Quick luceneutil benchmarks: 76% reduction in index size, 81% boost in indexing performance, 19% average boost in query performance.

        Show
        Nicholas Knize added a comment - Attached patch adds the following changes: Adds a new GeoPointTokenStream that encodes GeoPointField type into a maximum of 5 "prefix only" terms. Full precision post filtering uses DocValues. Adds a GeoPointTermQuery.TermEncoding enum to choose between existing NUMERIC encoding or new PREFIX encoding Refactors GeoPointTermsEnum into a base class with two derived classes: GeoPointNumericTermsEnum - Uses existing NumericTokenStream logic GeoPointPrefixTermsEnum - Uses new GeoPointTokenStream logic Refactors term encoding logic out of GeoUtils into a dedicated GeoEncodingUtils class Adds randomTermEncoding to TestGeoPointQuery to randomly test both encoding approaches Quick luceneutil benchmarks: 76% reduction in index size, 81% boost in indexing performance, 19% average boost in query performance.
        Hide
        Michael McCandless added a comment -

        Hmm I get compilation errors because things still reference GeoUtils but it was moved? Maybe run "ant clean" first and then you should see the errors?

        Show
        Michael McCandless added a comment - Hmm I get compilation errors because things still reference GeoUtils but it was moved? Maybe run "ant clean" first and then you should see the errors?
        Hide
        Nicholas Knize added a comment - - edited

        Try again with the new patch. I guess diff didn't like the way I refactored GeoUtils and added another new GeoUtils class.

        Show
        Nicholas Knize added a comment - - edited Try again with the new patch. I guess diff didn't like the way I refactored GeoUtils and added another new GeoUtils class.
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize

        If you pass --show-copies-as-adds to svn it will make a more easily applied patch ...

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize If you pass --show-copies-as-adds to svn it will make a more easily applied patch ...
        Hide
        Nicholas Knize added a comment -

        Nice!!! Thanks for that protip! That certainly would have come in handy in this situation.

        Show
        Nicholas Knize added a comment - Nice!!! Thanks for that protip! That certainly would have come in handy in this situation.
        Hide
        Michael McCandless added a comment -

        Hmm, the tests pass for me with this patch, but when I went to benchmark it, I'm seeing a different number of hits on trunk:

        ITER: 5 5.861972507 sec; totHits=221120357; 225 queries
        

        vs with the patch:

        ITER: 5 5.754698705 sec; totHits=221120418; 225 queries
        

        The results should not have changed, because we use doc values for precise matching, right? Why are tests not catching this

        Show
        Michael McCandless added a comment - Hmm, the tests pass for me with this patch, but when I went to benchmark it, I'm seeing a different number of hits on trunk: ITER: 5 5.861972507 sec; totHits=221120357; 225 queries vs with the patch: ITER: 5 5.754698705 sec; totHits=221120418; 225 queries The results should not have changed, because we use doc values for precise matching, right? Why are tests not catching this
        Hide
        Michael McCandless added a comment -

        It's odd that we pass GeoPointTermQuery down to GeoPointTermsEnum
        ctor which then goes and secretly sets a field:

          query.cellComparator.termEnum = this;
        

        Can we make GeoPointTermQuery package private again, and do this
        "up above"? And put TermEncoding somewhere else to be public?

        Can we rename GeoPointTermQuery to GeoPointMultiTermQuery?

        In the javadocs for TermEncoding's two options can you explain
        that one is newer and smaller/faster than the other, which is now
        "legacy" (NUMERIC)? Can we deprecate the NUMERIC one?

        In GeoPointTermsEnum.newInstance can we change:

            if (query.termEncoding == GeoPointTermQuery.TermEncoding.PREFIX) {
              return new GeoPointPrefixTermsEnum(terms, query);
            }
            return new GeoPointNumericTermsEnum(terms, query);
        

        to e.g.:

            if (query.termEncoding == GeoPointTermQuery.TermEncoding.PREFIX) {
              return new GeoPointPrefixTermsEnum(terms, query);
            } else if (query.termEncoding == GeoPointTermQuery.TermEncoding.NUMERIC) {
              return new GeoPointNumericTermsEnum(terms, query);
            } else {
              throw new IllegalArgumentException(...);
            }
        
        Show
        Michael McCandless added a comment - It's odd that we pass GeoPointTermQuery down to GeoPointTermsEnum ctor which then goes and secretly sets a field: query.cellComparator.termEnum = this; Can we make GeoPointTermQuery package private again, and do this "up above"? And put TermEncoding somewhere else to be public? Can we rename GeoPointTermQuery to GeoPointMultiTermQuery ? In the javadocs for TermEncoding 's two options can you explain that one is newer and smaller/faster than the other, which is now "legacy" ( NUMERIC )? Can we deprecate the NUMERIC one? In GeoPointTermsEnum.newInstance can we change: if (query.termEncoding == GeoPointTermQuery.TermEncoding.PREFIX) { return new GeoPointPrefixTermsEnum(terms, query); } return new GeoPointNumericTermsEnum(terms, query); to e.g.: if (query.termEncoding == GeoPointTermQuery.TermEncoding.PREFIX) { return new GeoPointPrefixTermsEnum(terms, query); } else if (query.termEncoding == GeoPointTermQuery.TermEncoding.NUMERIC) { return new GeoPointNumericTermsEnum(terms, query); } else { throw new IllegalArgumentException(...); }
        Hide
        Nicholas Knize added a comment -

        Thanks for the great feedback Michael McCandless Patch updated with recommendations.

        Show
        Nicholas Knize added a comment - Thanks for the great feedback Michael McCandless Patch updated with recommendations.
        Hide
        Nicholas Knize added a comment -

        It's related to TOLERANCE and that computeMaxShift in this patch is reused for all query types. So the depth of traversal for distanceQueries may vary slightly. The effect is that trunk has a handful of false negatives that this patch has picked up. The tests don't report them as failures because the deltas are within the TOLERANCE.

        Show
        Nicholas Knize added a comment - It's related to TOLERANCE and that computeMaxShift in this patch is reused for all query types. So the depth of traversal for distanceQueries may vary slightly. The effect is that trunk has a handful of false negatives that this patch has picked up. The tests don't report them as failures because the deltas are within the TOLERANCE .
        Hide
        Michael McCandless added a comment -

        Hmm I'm seeing this test failure with this patch:

        [junit4:pickseed] Seed property 'tests.seed' already defined: 3B2C0D9EBF6EC99D
           [junit4] <JUnit4> says hello! Master seed: 3B2C0D9EBF6EC99D
           [junit4] Executing 1 suite with 1 JVM.
           [junit4] 
           [junit4] Started J0 PID(30297@localhost).
           [junit4] Suite: org.apache.lucene.search.TestGeoPointQuery
           [junit4]   1> T4: id=7224 should not match but did
           [junit4]   1>   small=true query=GeoPointInBBoxQuery: field=point: Lower Left: [85.35664315745854,-41.59146759172397] Upper Right: [86.66116425340478,-40.74649261518726] docID=7060
           [junit4]   1>   lat=-40.74649160581509 lon=86.53170426878272
           [junit4]   1>   deleted?=false
           [junit4]   2> jan 22, 2016 2:35:24 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[T4,5,TGRP-TestGeoPointQuery]
           [junit4]   2> java.lang.AssertionError: some hits were wrong
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([3B2C0D9EBF6EC99D]:0)
           [junit4]   2> 	at org.junit.Assert.fail(Assert.java:93)
           [junit4]   2> 	at org.apache.lucene.util.BaseGeoPointTestCase$VerifyHits.test(BaseGeoPointTestCase.java:552)
           [junit4]   2> 	at org.apache.lucene.util.BaseGeoPointTestCase$2._run(BaseGeoPointTestCase.java:756)
           [junit4]   2> 	at org.apache.lucene.util.BaseGeoPointTestCase$2.run(BaseGeoPointTestCase.java:623)
           [junit4]   2> 
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeoPointQuery -Dtests.method=testAllLonEqual -Dtests.seed=3B2C0D9EBF6EC99D -Dtests.multiplier=2 -Dtests.slow=true -Dtests.linedocsfile=/lucenedata/hudson.enwiki.random.lines.txt.fixed -Dtests.locale=no -Dtests.timezone=Europe/Jersey -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   2.46s | TestGeoPointQuery.testAllLonEqual <<<
           [junit4]    > Throwable #1: java.lang.AssertionError
           [junit4]    > 	at org.apache.lucene.util.BaseGeoPointTestCase.verify(BaseGeoPointTestCase.java:770)
           [junit4]    > 	at org.apache.lucene.util.BaseGeoPointTestCase.testAllLonEqual(BaseGeoPointTestCase.java:203)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=19, name=T4, state=RUNNABLE, group=TGRP-TestGeoPointQuery]
           [junit4]    > Caused by: java.lang.AssertionError: some hits were wrong
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([3B2C0D9EBF6EC99D]:0)
           [junit4]    > 	at org.apache.lucene.util.BaseGeoPointTestCase$VerifyHits.test(BaseGeoPointTestCase.java:552)
           [junit4]    > 	at org.apache.lucene.util.BaseGeoPointTestCase$2._run(BaseGeoPointTestCase.java:756)
           [junit4]    > 	at org.apache.lucene.util.BaseGeoPointTestCase$2.run(BaseGeoPointTestCase.java:623)
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene60): {id=FSTOrd50, point=PostingsFormat(name=MockRandom)}, docValues:{id=DocValuesFormat(name=Direct), point=DocValuesFormat(name=Memory)}, sim=ClassicSimilarity, locale=no, timezone=Europe/Jersey
           [junit4]   2> NOTE: Linux 3.13.0-71-generic amd64/Oracle Corporation 1.8.0_60 (64-bit)/cpus=8,threads=1,free=417685072,total=493355008
           [junit4]   2> NOTE: All tests run in this JVM: [TestGeoPointQuery]
           [junit4] Completed [1/1 (1!)] in 2.87s, 1 test, 1 error <<< FAILURES!
           [junit4] 
           [junit4] 
           [junit4] Tests with failures [seed: 3B2C0D9EBF6EC99D]:
           [junit4]   - org.apache.lucene.search.TestGeoPointQuery.testAllLonEqual
        
        Show
        Michael McCandless added a comment - Hmm I'm seeing this test failure with this patch: [junit4:pickseed] Seed property 'tests.seed' already defined: 3B2C0D9EBF6EC99D [junit4] <JUnit4> says hello! Master seed: 3B2C0D9EBF6EC99D [junit4] Executing 1 suite with 1 JVM. [junit4] [junit4] Started J0 PID(30297@localhost). [junit4] Suite: org.apache.lucene.search.TestGeoPointQuery [junit4] 1> T4: id=7224 should not match but did [junit4] 1> small=true query=GeoPointInBBoxQuery: field=point: Lower Left: [85.35664315745854,-41.59146759172397] Upper Right: [86.66116425340478,-40.74649261518726] docID=7060 [junit4] 1> lat=-40.74649160581509 lon=86.53170426878272 [junit4] 1> deleted?=false [junit4] 2> jan 22, 2016 2:35:24 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[T4,5,TGRP-TestGeoPointQuery] [junit4] 2> java.lang.AssertionError: some hits were wrong [junit4] 2> at __randomizedtesting.SeedInfo.seed([3B2C0D9EBF6EC99D]:0) [junit4] 2> at org.junit.Assert.fail(Assert.java:93) [junit4] 2> at org.apache.lucene.util.BaseGeoPointTestCase$VerifyHits.test(BaseGeoPointTestCase.java:552) [junit4] 2> at org.apache.lucene.util.BaseGeoPointTestCase$2._run(BaseGeoPointTestCase.java:756) [junit4] 2> at org.apache.lucene.util.BaseGeoPointTestCase$2.run(BaseGeoPointTestCase.java:623) [junit4] 2> [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestGeoPointQuery -Dtests.method=testAllLonEqual -Dtests.seed=3B2C0D9EBF6EC99D -Dtests.multiplier=2 -Dtests.slow=true -Dtests.linedocsfile=/lucenedata/hudson.enwiki.random.lines.txt.fixed -Dtests.locale=no -Dtests.timezone=Europe/Jersey -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 2.46s | TestGeoPointQuery.testAllLonEqual <<< [junit4] > Throwable #1: java.lang.AssertionError [junit4] > at org.apache.lucene.util.BaseGeoPointTestCase.verify(BaseGeoPointTestCase.java:770) [junit4] > at org.apache.lucene.util.BaseGeoPointTestCase.testAllLonEqual(BaseGeoPointTestCase.java:203) [junit4] > at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=19, name=T4, state=RUNNABLE, group=TGRP-TestGeoPointQuery] [junit4] > Caused by: java.lang.AssertionError: some hits were wrong [junit4] > at __randomizedtesting.SeedInfo.seed([3B2C0D9EBF6EC99D]:0) [junit4] > at org.apache.lucene.util.BaseGeoPointTestCase$VerifyHits.test(BaseGeoPointTestCase.java:552) [junit4] > at org.apache.lucene.util.BaseGeoPointTestCase$2._run(BaseGeoPointTestCase.java:756) [junit4] > at org.apache.lucene.util.BaseGeoPointTestCase$2.run(BaseGeoPointTestCase.java:623) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene60): {id=FSTOrd50, point=PostingsFormat(name=MockRandom)}, docValues:{id=DocValuesFormat(name=Direct), point=DocValuesFormat(name=Memory)}, sim=ClassicSimilarity, locale=no, timezone=Europe/Jersey [junit4] 2> NOTE: Linux 3.13.0-71-generic amd64/Oracle Corporation 1.8.0_60 (64-bit)/cpus=8,threads=1,free=417685072,total=493355008 [junit4] 2> NOTE: All tests run in this JVM: [TestGeoPointQuery] [junit4] Completed [1/1 (1!)] in 2.87s, 1 test, 1 error <<< FAILURES! [junit4] [junit4] [junit4] Tests with failures [seed: 3B2C0D9EBF6EC99D]: [junit4] - org.apache.lucene.search.TestGeoPointQuery.testAllLonEqual
        Hide
        Michael McCandless added a comment -

        I think you can attach javadoc to each enum constant value
        (PREFIX, NUMERIC)? And then add @deprecated for
        NUMERIC's javadoc.

        Can you add braces even if the body is a single line? E.g.:

        +    if (isInit == false)
        +      throw new IllegalStateException("call setGeoCode() before usage");
        
        Show
        Michael McCandless added a comment - I think you can attach javadoc to each enum constant value ( PREFIX , NUMERIC )? And then add @deprecated for NUMERIC 's javadoc. Can you add braces even if the body is a single line? E.g.: + if (isInit == false) + throw new IllegalStateException("call setGeoCode() before usage");
        Hide
        Nicholas Knize added a comment -

        That's unrelated to this patch. Its from nuking the pre-quantization in the tests and is being fixed by LUCENE-6956.

        Show
        Nicholas Knize added a comment - That's unrelated to this patch. Its from nuking the pre-quantization in the tests and is being fixed by LUCENE-6956 .
        Hide
        Michael McCandless added a comment -

        OK, phew!

        Show
        Michael McCandless added a comment - OK, phew!
        Hide
        Nicholas Knize added a comment - - edited

        Updated patch to include the following:

        • incorporate review feedback
        • override GeoPointPrefixTermsEnum.accept to "seek" to the "floor" range of the candidate term. This boosts query performance by eliminating superfluous range visits.
        • fixed bug in GeoEncodingUtils.geoCodedToPrefixCodedBytes and .getPrefixCodedLongShift that was ignoring the BytesRef.offset variable

        I'm going to open up another query performance improvement issue that switches from comparing BytesRefs to directly comparing the long encoded range values. This will instead convert candidate terms to their encoded range values and eliminate the need for constantly converting ranges to BytesRefs for comparisons.

        NOTE: Beast testing this may result in some accuracy failures that are being fixed separately by LUCENE-6956

        Show
        Nicholas Knize added a comment - - edited Updated patch to include the following: incorporate review feedback override GeoPointPrefixTermsEnum.accept to "seek" to the "floor" range of the candidate term. This boosts query performance by eliminating superfluous range visits. fixed bug in GeoEncodingUtils.geoCodedToPrefixCodedBytes and .getPrefixCodedLongShift that was ignoring the BytesRef.offset variable I'm going to open up another query performance improvement issue that switches from comparing BytesRefs to directly comparing the long encoded range values. This will instead convert candidate terms to their encoded range values and eliminate the need for constantly converting ranges to BytesRefs for comparisons. NOTE: Beast testing this may result in some accuracy failures that are being fixed separately by LUCENE-6956
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize ... the last patch is a bit odd, e.g. removing the entire GeoUtils.java and adding it back again ... can you fix it? Maybe there is some magical git option to git diff?

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize ... the last patch is a bit odd, e.g. removing the entire GeoUtils.java and adding it back again ... can you fix it? Maybe there is some magical git option to git diff?
        Hide
        Michael McCandless added a comment -

        Hmm, also a few compilation errors, e.g.:

            [javac] /l/nick/lucene/sandbox/src/test/org/apache/lucene/search/TestGeoPointQuery.java:373: error: cannot find symbol
            [javac]       long enc = GeoUtils.mortonHash(lon, lat);
            [javac]                          ^
            [javac]   symbol:   method mortonHash(double,double)
            [javac]   location: class GeoUtils
            [javac] /l/nick/lucene/sandbox/src/test/org/apache/lucene/search/TestGeoPointQuery.java:374: error: cannot find symbol
            [javac]       double latEnc = GeoUtils.mortonUnhashLat(enc);
            [javac]                               ^
            [javac]   symbol:   method mortonUnhashLat(long)
            [javac]   location: class GeoUtils
        
        Show
        Michael McCandless added a comment - Hmm, also a few compilation errors, e.g.: [javac] /l/nick/lucene/sandbox/src/test/org/apache/lucene/search/TestGeoPointQuery.java:373: error: cannot find symbol [javac] long enc = GeoUtils.mortonHash(lon, lat); [javac] ^ [javac] symbol: method mortonHash(double,double) [javac] location: class GeoUtils [javac] /l/nick/lucene/sandbox/src/test/org/apache/lucene/search/TestGeoPointQuery.java:374: error: cannot find symbol [javac] double latEnc = GeoUtils.mortonUnhashLat(enc); [javac] ^ [javac] symbol: method mortonUnhashLat(long) [javac] location: class GeoUtils
        Hide
        Nicholas Knize added a comment -

        argghh.. guessed the fancy --show-copies-as-adds didn't work this time. I'll prep a new patch shortly. Thx Michael McCandless!

        Show
        Nicholas Knize added a comment - argghh.. guessed the fancy --show-copies-as-adds didn't work this time. I'll prep a new patch shortly. Thx Michael McCandless !
        Hide
        Nicholas Knize added a comment -

        Corrected patch

        Show
        Nicholas Knize added a comment - Corrected patch
        Hide
        Michael McCandless added a comment -

        Thanks Nicholas Knize, this is a nice (and large!) change.

        The sole "R" in this javadoc left me hanging a bit

          /**
           * GeoTerms are coded as the following:
           *
           * R
           */
        
        

        Can you update GeoPointDistanceQuery javadocs explaining the max
        radius limit? I.e. that the circle projected on the earth's surface
        cannot wrap around and touch itself again (if I understand that
        right!)?

        Can we move GeoPointTokenStream under o.a.l.document and make
        it package private? (And make TermEncoding public elsewhere.)

        Can all other ctors of GeoPointField just forward to the primary
        ("takes everything") ctor call, i.e. call this(...) instead of
        super(...)? Also, can we break this compound ternary operator
        into a static helper method?:

            super(name, stored == Store.YES ?
                termEncoding == GeoPointTokenStream.TermEncoding.PREFIX ? PREFIX_TYPE_STORED : NUMERIC_TYPE_STORED :
                termEncoding == GeoPointTokenStream.TermEncoding.PREFIX ? PREFIX_TYPE_NOT_STORED : NUMERIC_TYPE_NOT_STORED);
        

        E.g. maybe getFieldType.

        Should it be an error if you pass a custom FieldType to
        GeoPointField that disabled indexing? I.e. catch that up front,
        where we check DV type and numeric type, and then remove this:

            if (fieldType().indexOptions() == IndexOptions.NONE) {
              // Not indexed
              return null;
            }
        

        from the tokenStream method?

        Can we deprecate the GeoPointField ctors that take
        TermEncoding? (You should use/migrate to the default ctor that
        uses the better PREFIX encoding).

        GeoUtils.longToByteArray and .fromByteArray and
        GeoEncodingUtils.geoTermToString look dead?

        This comment confuses me:

          // start shift at 61
          private short shift;
        

        Does it really start at 61? Seems like (computeMaxShift) it's
        either 45 (for a large bbox) or 36 (for a not-large bbox)? Can we
        move the comment down to where we actually do assign to shift?

        Show
        Michael McCandless added a comment - Thanks Nicholas Knize , this is a nice (and large!) change. The sole "R" in this javadoc left me hanging a bit /** * GeoTerms are coded as the following: * * R */ Can you update GeoPointDistanceQuery javadocs explaining the max radius limit? I.e. that the circle projected on the earth's surface cannot wrap around and touch itself again (if I understand that right!)? Can we move GeoPointTokenStream under o.a.l.document and make it package private? (And make TermEncoding public elsewhere.) Can all other ctors of GeoPointField just forward to the primary ("takes everything") ctor call, i.e. call this(...) instead of super(...) ? Also, can we break this compound ternary operator into a static helper method?: super(name, stored == Store.YES ? termEncoding == GeoPointTokenStream.TermEncoding.PREFIX ? PREFIX_TYPE_STORED : NUMERIC_TYPE_STORED : termEncoding == GeoPointTokenStream.TermEncoding.PREFIX ? PREFIX_TYPE_NOT_STORED : NUMERIC_TYPE_NOT_STORED); E.g. maybe getFieldType . Should it be an error if you pass a custom FieldType to GeoPointField that disabled indexing? I.e. catch that up front, where we check DV type and numeric type, and then remove this: if (fieldType().indexOptions() == IndexOptions.NONE) { // Not indexed return null; } from the tokenStream method? Can we deprecate the GeoPointField ctors that take TermEncoding ? (You should use/migrate to the default ctor that uses the better PREFIX encoding). GeoUtils.longToByteArray and .fromByteArray and GeoEncodingUtils.geoTermToString look dead? This comment confuses me: // start shift at 61 private short shift; Does it really start at 61? Seems like ( computeMaxShift ) it's either 45 (for a large bbox) or 36 (for a not-large bbox)? Can we move the comment down to where we actually do assign to shift?
        Hide
        Nicholas Knize added a comment -

        Thanks for the feedback Michael McCandless! I'll post an updated patch shortly. In the meantime I think this will be blocked by LUCENE-6997 so I may put this in a feature branch.

        Show
        Nicholas Knize added a comment - Thanks for the feedback Michael McCandless ! I'll post an updated patch shortly. In the meantime I think this will be blocked by LUCENE-6997 so I may put this in a feature branch.
        Hide
        Nicholas Knize added a comment -

        Updated patch to include review feedback.

        Show
        Nicholas Knize added a comment - Updated patch to include review feedback.
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        If a user accidentally indexes with the legacy (NUMERIC) encoding
        but searches with PREFIX it won't be detected right? Like they
        will just get 0 results? I don't think we must fix this... seems like
        it's not easy since there is no "schema" for this.

        Hmm I didn't see this added?

        Can you update GeoPointDistanceQuery javadocs explaining the max
        radius limit? I.e. that the circle projected on the earth's surface
        cannot wrap around and touch itself again (if I understand that
        right!)?

        +1 to commit! Thanks Nicholas Knize ... I'm curious to see how this improves
        the metrics (indexing time, index size, heap used by IndexReader,
        search time).

        Show
        Michael McCandless added a comment - Patch looks good! If a user accidentally indexes with the legacy ( NUMERIC ) encoding but searches with PREFIX it won't be detected right? Like they will just get 0 results? I don't think we must fix this... seems like it's not easy since there is no "schema" for this. Hmm I didn't see this added? Can you update GeoPointDistanceQuery javadocs explaining the max radius limit? I.e. that the circle projected on the earth's surface cannot wrap around and touch itself again (if I understand that right!)? +1 to commit! Thanks Nicholas Knize ... I'm curious to see how this improves the metrics (indexing time, index size, heap used by IndexReader , search time).
        Hide
        ASF subversion and git services added a comment -

        Commit ae3b388e974960091594aee7e1b39d3d3a090520 in lucene-solr's branch refs/heads/master from nknize
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae3b388 ]

        LUCENE-6930: Decouples GeoPointField from NumericType by using a custom GeoPointTokenStream and TermEnum designed for GeoPoint prefix terms

        Show
        ASF subversion and git services added a comment - Commit ae3b388e974960091594aee7e1b39d3d3a090520 in lucene-solr's branch refs/heads/master from nknize [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae3b388 ] LUCENE-6930 : Decouples GeoPointField from NumericType by using a custom GeoPointTokenStream and TermEnum designed for GeoPoint prefix terms
        Hide
        ASF subversion and git services added a comment -

        Commit 74a08c08006941b74eda585b86b57fbe0ff341b2 in lucene-solr's branch refs/heads/branch_5x from nknize
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74a08c0 ]

        LUCENE-6930: Decouples GeoPointField from NumericType by using a custom GeoPointTokenStream and TermEnum designed for GeoPoint prefix terms

        Show
        ASF subversion and git services added a comment - Commit 74a08c08006941b74eda585b86b57fbe0ff341b2 in lucene-solr's branch refs/heads/branch_5x from nknize [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74a08c0 ] LUCENE-6930 : Decouples GeoPointField from NumericType by using a custom GeoPointTokenStream and TermEnum designed for GeoPoint prefix terms

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicholas Knize
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development