Lucene - Core
  1. Lucene - Core
  2. LUCENE-5879

Add auto-prefix terms to block tree terms dict

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This cool idea to generalize numeric/trie fields came from Adrien:

      Today, when we index a numeric field (LongField, etc.) we pre-compute
      (via NumericTokenStream) outside of indexer/codec which prefix terms
      should be indexed.

      But this can be inefficient: you set a static precisionStep, and
      always add those prefix terms regardless of how the terms in the field
      are actually distributed. Yet typically in real world applications
      the terms have a non-random distribution.

      So, it should be better if instead the terms dict decides where it
      makes sense to insert prefix terms, based on how dense the terms are
      in each region of term space.

      This way we can speed up query time for both term (e.g. infix
      suggester) and numeric ranges, and it should let us use less index
      space and get faster range queries.

      This would also mean that min/maxTerm for a numeric field would now be
      correct, vs today where the externally computed prefix terms are
      placed after the full precision terms, causing hairy code like
      NumericUtils.getMaxInt/Long. So optos like LUCENE-5860 become
      feasible.

      The terms dict can also do tricks not possible if you must live on top
      of its APIs, e.g. to handle the adversary/over-constrained case when a
      given prefix has too many terms following it but finer prefixes
      have too few (what block tree calls "floor term blocks").

      1. LUCENE-5879.patch
        327 kB
        Michael McCandless
      2. LUCENE-5879.patch
        226 kB
        Michael McCandless
      3. LUCENE-5879.patch
        225 kB
        Michael McCandless
      4. LUCENE-5879.patch
        220 kB
        Michael McCandless
      5. LUCENE-5879.patch
        218 kB
        Michael McCandless
      6. LUCENE-5879.patch
        292 kB
        Michael McCandless
      7. LUCENE-5879.patch
        281 kB
        Michael McCandless
      8. LUCENE-5879.patch
        279 kB
        Michael McCandless
      9. LUCENE-5879.patch
        268 kB
        Michael McCandless
      10. LUCENE-5879.patch
        244 kB
        Michael McCandless
      11. LUCENE-5879.patch
        217 kB
        Michael McCandless
      12. LUCENE-5879.patch
        213 kB
        Michael McCandless
      13. LUCENE-5879.patch
        149 kB
        Michael McCandless
      14. LUCENE-5879.patch
        133 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Initial work-in-progress patch: tests do NOT consistently pass, there
          are still bugs/corner cases (e.g. when an auto-prefix term is the last
          term in a block...).

          This change very much requires LUCENE-5268 (pull API for postings
          format) which I'd like to backport to 4.x along with this: it makes an
          initial pass through all terms to identify "good" prefixes, using the
          same algorithm block tree uses to assign terms to blocks, just with
          different block sizes. I haven't picked defaults yet, but e.g. you
          could state that an auto prefix term should expand to 100-200 terms
          and then the first pass picks prefix terms to handle that.

          The problem is inherently over-constrained: a given set of prefixes
          like fooa*, foob*, fooc*, etc. may have too-few terms each, but then
          their common prefix foo* would have way too many. For this case it
          creates "floored" prefix terms, e.g. foo[a-e]*, foo[f-p]*, foo[q-z]*.

          On the 2nd pass, when it writes the actual terms, it inserts these
          auto-prefix terms at the right places.

          Currently it only works for DOCS_ONLY fields, and it uses a
          FixedBitSet(maxDoc) when writing each prefix term.

          These auto-prefix terms are fully hidden from all the normal
          Terms/Enum APIs, statistics, etc. They are only used in
          Terms.intersect, if you pass a new flag allowing them to be used.

          I haven't done anything about the document / searching side of things:
          this is just a low level change at this point, for the terms dict.
          Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some
          such; currently it's just exposed as additional params to the block
          tree terms dict writer.

          I think this would mean NumericRangeQuery/Filter can just rewrite to
          ordinary TermRangeQuery/Filter, and the numeric fields just become
          sugar for encoding their numeric values as sortable binary terms.

          Show
          Michael McCandless added a comment - Initial work-in-progress patch: tests do NOT consistently pass, there are still bugs/corner cases (e.g. when an auto-prefix term is the last term in a block...). This change very much requires LUCENE-5268 (pull API for postings format) which I'd like to backport to 4.x along with this: it makes an initial pass through all terms to identify "good" prefixes, using the same algorithm block tree uses to assign terms to blocks, just with different block sizes. I haven't picked defaults yet, but e.g. you could state that an auto prefix term should expand to 100-200 terms and then the first pass picks prefix terms to handle that. The problem is inherently over-constrained: a given set of prefixes like fooa*, foob*, fooc*, etc. may have too-few terms each, but then their common prefix foo* would have way too many. For this case it creates "floored" prefix terms, e.g. foo[a-e]*, foo[f-p]*, foo[q-z]*. On the 2nd pass, when it writes the actual terms, it inserts these auto-prefix terms at the right places. Currently it only works for DOCS_ONLY fields, and it uses a FixedBitSet(maxDoc) when writing each prefix term. These auto-prefix terms are fully hidden from all the normal Terms/Enum APIs, statistics, etc. They are only used in Terms.intersect, if you pass a new flag allowing them to be used. I haven't done anything about the document / searching side of things: this is just a low level change at this point, for the terms dict. Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some such; currently it's just exposed as additional params to the block tree terms dict writer. I think this would mean NumericRangeQuery/Filter can just rewrite to ordinary TermRangeQuery/Filter, and the numeric fields just become sugar for encoding their numeric values as sortable binary terms.
          Hide
          Michael McCandless added a comment -

          Another iteration ... still tons of nocommits. I added Automata.makeBinaryInterval, so you can easily make an automaton that matches a min / max term range. I plan to switch the read-time API to use a new CompiledAutomaton.AUTOMATON_TYPE.RANGE, and fix e.g. PrefixQuery (and eventually maybe the infix suggester, and numeric range query) to use this API.

          Show
          Michael McCandless added a comment - Another iteration ... still tons of nocommits. I added Automata.makeBinaryInterval, so you can easily make an automaton that matches a min / max term range. I plan to switch the read-time API to use a new CompiledAutomaton.AUTOMATON_TYPE.RANGE, and fix e.g. PrefixQuery (and eventually maybe the infix suggester, and numeric range query) to use this API.
          Hide
          Adrien Grand added a comment -

          Thanks Mike for working on this, this is a very exciting issue! I'm very curious what the space/speed trade-off will look-like compared to static prefix encoding like NumericUtils does.

          Maybe we need a new FieldType boolean "computeAutoPrefixTerms"

          Why would it be needed? If the search APIs use intersect, this should be transparent?

          Show
          Adrien Grand added a comment - Thanks Mike for working on this, this is a very exciting issue! I'm very curious what the space/speed trade-off will look-like compared to static prefix encoding like NumericUtils does. Maybe we need a new FieldType boolean "computeAutoPrefixTerms" Why would it be needed? If the search APIs use intersect, this should be transparent?
          Hide
          Michael McCandless added a comment -

          Maybe we need a new FieldType boolean "computeAutoPrefixTerms"

          Why would it be needed? If the search APIs use intersect, this should be transparent?

          I think being totally transparent would be the best solution! This would mean BT will always index auto-prefix terms for DOCS_ONLY fields ... I'll just have to test what the indexing time / disk usage cost is. If we need to make it optional at indexing time, I'm not sure what the API should look like to make it easy...

          Show
          Michael McCandless added a comment - Maybe we need a new FieldType boolean "computeAutoPrefixTerms" Why would it be needed? If the search APIs use intersect, this should be transparent? I think being totally transparent would be the best solution! This would mean BT will always index auto-prefix terms for DOCS_ONLY fields ... I'll just have to test what the indexing time / disk usage cost is. If we need to make it optional at indexing time, I'm not sure what the API should look like to make it easy...
          Hide
          Robert Muir added a comment -

          If the existing API is confusing, making it worse by adding a confusing boolean won't really fix the problem.

          Show
          Robert Muir added a comment - If the existing API is confusing, making it worse by adding a confusing boolean won't really fix the problem.
          Hide
          Uwe Schindler added a comment -

          In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term?

          Otherwise very cool!

          Show
          Uwe Schindler added a comment - In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term? Otherwise very cool!
          Hide
          Michael McCandless added a comment -

          In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term?

          I think so! We should be able to use NumericUtils.XToSortableInt/Long I think?

          But that's phase 2 here... it's hard enough just getting these terms working low-level...

          Show
          Michael McCandless added a comment - In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term? I think so! We should be able to use NumericUtils.XToSortableInt/Long I think? But that's phase 2 here... it's hard enough just getting these terms working low-level...
          Hide
          Michael McCandless added a comment -

          Latest patch: tests are passing, but still many nocommits ... I think net/net the approach is done and now I'm gonna wrap up / clean nocommits.

          Show
          Michael McCandless added a comment - Latest patch: tests are passing, but still many nocommits ... I think net/net the approach is done and now I'm gonna wrap up / clean nocommits.
          Hide
          Michael McCandless added a comment -

          New patch, resolving a few nocommits (but more remain!). Tests pass.

          I created a simple benchmark, generating random longs over the full
          range, and random ranges as queries. For each test, I build the index
          once (10M longs), and then run all queries (10K random ranges) 10
          times and record best time. I also print out the number of terms
          visited for the first query as a coarse measure of the prefix terms
          density.

          NF is numeric field and AP is auto-prefix. For AP I just index the
          long values as 8 byte binary term, using the same logic in
          NumericUtils.longToPrefixCodedBytes to make the binary sort the same
          as the numeric sort.

          Source code for benchmark is here:

          https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/python/autoPrefixPerf.py

          And here:

          https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/main/perf/AutoPrefixPerf.java

          Indexer uses single thread and SerialMergeScheduler to try to measure
          the added indexing cost terms writing.

          Currently the auto-prefix terms are configured like term blocks in
          block tree, with a min/max number of items (terms, or "child"
          auto-prefix terms) per auto-prefix term.

          Here are the raw results ... net/net it looks like AP can match NF's
          performance with a ~50% smaller index and slightly faster indexing
          time. The two approaches are quite different: NF uses TermsEnum.seek,
          but AP uses Terms.intersect which for block tree does no seeking.

          However, there's a non-trivial indexing time & space cost vs the
          baseline... not sure we can/should just always enable this by default
          for DOCS_ONLY fields ...

          Baseline (8-byte binary terms like AP)
              index sec 20.312 sec
              index MB 123.4
          
          NF precStep=4
              index sec 225.06
              index MB 1275.70
              term count 560
              search msec 6056
          NF precStep=8
              index sec 115.44
              index MB 675.55
              term count 2983
              search msec 5547
          NF precStep=12
              index sec 80.77
              index MB 470.96
              term count 37405
              search msec 6080
          NF precStep=16 (default)
              index sec 61.13
              index MB 363.19
              term count 125906
              search msec 10466
          AP min=5 max=8
              index sec 194.21
              index MB 272.06
              term count 315
              search msec 5715
          AP min=5 max=12
              index sec 179.86
              index MB 256.88
              term count 295
              search msec 5771
          AP min=5 max=16
              index sec 168.91
              index MB 254.32
              term count 310
              search msec 5727
          AP min=5 max=20
              index sec 157.48
              index MB 252.04
              term count 321
              search msec 5742
          AP min=5 max=2147483647
              index sec 64.03
              index MB 164.55
              term count 3955
              search msec 6168
          AP min=10 max=18
              index sec 106.52
              index MB 215.26
              term count 552
              search msec 5792
          AP min=10 max=27
              index sec 99.00
              index MB 212.45
              term count 533
              search msec 5814
          AP min=10 max=36
              index sec 88.45
              index MB 207.43
              term count 505
              search msec 5850
          AP min=10 max=45
              index sec 79.15
              index MB 194.73
              term count 650
              search msec 5681
          AP min=10 max=2147483647
              index sec 42.68
              index MB 162.64
              term count 6077
              search msec 6199
          AP min=15 max=28
              index sec 84.83
              index MB 204.29
              term count 641
              search msec 5763
          AP min=15 max=42
              index sec 74.20
              index MB 193.24
              term count 753
              search msec 5828
          AP min=15 max=56
              index sec 63.69
              index MB 190.06
              term count 662
              search msec 5839
          AP min=15 max=70
              index sec 62.53
              index MB 185.96
              term count 866
              search msec 5846
          AP min=15 max=2147483647
              index sec 40.94
              index MB 162.52
              term count 6258
              search msec 6156
          AP min=20 max=38
              index sec 69.26
              index MB 192.11
              term count 839
              search msec 5837
          AP min=20 max=57
              index sec 60.75
              index MB 186.18
              term count 1034
              search msec 5877
          AP min=20 max=76
              index sec 60.69
              index MB 185.21
              term count 980
              search msec 5866
          AP min=20 max=95
              index sec 59.87
              index MB 184.20
              term count 985
              search msec 5940
          AP min=20 max=2147483647
              index sec 41.64
              index MB 162.52
              term count 6258
              search msec 6196
          AP min=25 max=48
              index sec 61.81
              index MB 187.08
              term count 806
              search msec 5790
          AP min=25 max=72
              index sec 58.69
              index MB 183.02
              term count 929
              search msec 5894
          AP min=25 max=96
              index sec 56.80
              index MB 178.29
              term count 841
              search msec 5938
          AP min=25 max=120
              index sec 55.81
              index MB 177.75
              term count 1044
              search msec 5883
          AP min=25 max=2147483647
              index sec 40.99
              index MB 162.52
              term count 6258
              search msec 6189
          AP min=30 max=58
              index sec 56.99
              index MB 182.91
              term count 1012
              search msec 5891
          AP min=30 max=87
              index sec 55.22
              index MB 178.16
              term count 1065
              search msec 5921
          AP min=30 max=116
              index sec 54.78
              index MB 177.39
              term count 1142
              search msec 5811
          AP min=30 max=145
              index sec 54.22
              index MB 176.64
              term count 1340
              search msec 5798
          AP min=30 max=2147483647
              index sec 40.76
              index MB 162.44
              term count 6445
              search msec 6191
          AP min=35 max=68
              index sec 56.75
              index MB 182.45
              term count 1265
              search msec 5403
          AP min=35 max=102
              index sec 53.87
              index MB 177.99
              term count 1302
              search msec 5924
          AP min=35 max=136
              index sec 51.10
              index MB 177.25
              term count 1547
              search msec 6082
          AP min=35 max=170
              index sec 50.77
              index MB 176.93
              term count 1402
              search msec 5994
          AP min=35 max=2147483647
              index sec 40.73
              index MB 161.75
              term count 8045
              search msec 6271
          AP min=40 max=78
              index sec 55.75
              index MB 185.57
              term count 1499
              search msec 5917
          AP min=40 max=117
              index sec 53.12
              index MB 180.57
              term count 1501
              search msec 5961
          AP min=40 max=156
              index sec 50.93
              index MB 179.38
              term count 1806
              search msec 5487
          AP min=40 max=195
              index sec 50.31
              index MB 178.81
              term count 1763
              search msec 5484
          AP min=40 max=2147483647
              index sec 38.59
              index MB 158.98
              term count 12573
              search msec 6281
          
          Show
          Michael McCandless added a comment - New patch, resolving a few nocommits (but more remain!). Tests pass. I created a simple benchmark, generating random longs over the full range, and random ranges as queries. For each test, I build the index once (10M longs), and then run all queries (10K random ranges) 10 times and record best time. I also print out the number of terms visited for the first query as a coarse measure of the prefix terms density. NF is numeric field and AP is auto-prefix. For AP I just index the long values as 8 byte binary term, using the same logic in NumericUtils.longToPrefixCodedBytes to make the binary sort the same as the numeric sort. Source code for benchmark is here: https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/python/autoPrefixPerf.py And here: https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/main/perf/AutoPrefixPerf.java Indexer uses single thread and SerialMergeScheduler to try to measure the added indexing cost terms writing. Currently the auto-prefix terms are configured like term blocks in block tree, with a min/max number of items (terms, or "child" auto-prefix terms) per auto-prefix term. Here are the raw results ... net/net it looks like AP can match NF's performance with a ~50% smaller index and slightly faster indexing time. The two approaches are quite different: NF uses TermsEnum.seek, but AP uses Terms.intersect which for block tree does no seeking. However, there's a non-trivial indexing time & space cost vs the baseline... not sure we can/should just always enable this by default for DOCS_ONLY fields ... Baseline (8-byte binary terms like AP) index sec 20.312 sec index MB 123.4 NF precStep=4 index sec 225.06 index MB 1275.70 term count 560 search msec 6056 NF precStep=8 index sec 115.44 index MB 675.55 term count 2983 search msec 5547 NF precStep=12 index sec 80.77 index MB 470.96 term count 37405 search msec 6080 NF precStep=16 (default) index sec 61.13 index MB 363.19 term count 125906 search msec 10466 AP min=5 max=8 index sec 194.21 index MB 272.06 term count 315 search msec 5715 AP min=5 max=12 index sec 179.86 index MB 256.88 term count 295 search msec 5771 AP min=5 max=16 index sec 168.91 index MB 254.32 term count 310 search msec 5727 AP min=5 max=20 index sec 157.48 index MB 252.04 term count 321 search msec 5742 AP min=5 max=2147483647 index sec 64.03 index MB 164.55 term count 3955 search msec 6168 AP min=10 max=18 index sec 106.52 index MB 215.26 term count 552 search msec 5792 AP min=10 max=27 index sec 99.00 index MB 212.45 term count 533 search msec 5814 AP min=10 max=36 index sec 88.45 index MB 207.43 term count 505 search msec 5850 AP min=10 max=45 index sec 79.15 index MB 194.73 term count 650 search msec 5681 AP min=10 max=2147483647 index sec 42.68 index MB 162.64 term count 6077 search msec 6199 AP min=15 max=28 index sec 84.83 index MB 204.29 term count 641 search msec 5763 AP min=15 max=42 index sec 74.20 index MB 193.24 term count 753 search msec 5828 AP min=15 max=56 index sec 63.69 index MB 190.06 term count 662 search msec 5839 AP min=15 max=70 index sec 62.53 index MB 185.96 term count 866 search msec 5846 AP min=15 max=2147483647 index sec 40.94 index MB 162.52 term count 6258 search msec 6156 AP min=20 max=38 index sec 69.26 index MB 192.11 term count 839 search msec 5837 AP min=20 max=57 index sec 60.75 index MB 186.18 term count 1034 search msec 5877 AP min=20 max=76 index sec 60.69 index MB 185.21 term count 980 search msec 5866 AP min=20 max=95 index sec 59.87 index MB 184.20 term count 985 search msec 5940 AP min=20 max=2147483647 index sec 41.64 index MB 162.52 term count 6258 search msec 6196 AP min=25 max=48 index sec 61.81 index MB 187.08 term count 806 search msec 5790 AP min=25 max=72 index sec 58.69 index MB 183.02 term count 929 search msec 5894 AP min=25 max=96 index sec 56.80 index MB 178.29 term count 841 search msec 5938 AP min=25 max=120 index sec 55.81 index MB 177.75 term count 1044 search msec 5883 AP min=25 max=2147483647 index sec 40.99 index MB 162.52 term count 6258 search msec 6189 AP min=30 max=58 index sec 56.99 index MB 182.91 term count 1012 search msec 5891 AP min=30 max=87 index sec 55.22 index MB 178.16 term count 1065 search msec 5921 AP min=30 max=116 index sec 54.78 index MB 177.39 term count 1142 search msec 5811 AP min=30 max=145 index sec 54.22 index MB 176.64 term count 1340 search msec 5798 AP min=30 max=2147483647 index sec 40.76 index MB 162.44 term count 6445 search msec 6191 AP min=35 max=68 index sec 56.75 index MB 182.45 term count 1265 search msec 5403 AP min=35 max=102 index sec 53.87 index MB 177.99 term count 1302 search msec 5924 AP min=35 max=136 index sec 51.10 index MB 177.25 term count 1547 search msec 6082 AP min=35 max=170 index sec 50.77 index MB 176.93 term count 1402 search msec 5994 AP min=35 max=2147483647 index sec 40.73 index MB 161.75 term count 8045 search msec 6271 AP min=40 max=78 index sec 55.75 index MB 185.57 term count 1499 search msec 5917 AP min=40 max=117 index sec 53.12 index MB 180.57 term count 1501 search msec 5961 AP min=40 max=156 index sec 50.93 index MB 179.38 term count 1806 search msec 5487 AP min=40 max=195 index sec 50.31 index MB 178.81 term count 1763 search msec 5484 AP min=40 max=2147483647 index sec 38.59 index MB 158.98 term count 12573 search msec 6281
          Hide
          Adrien Grand added a comment -

          The numbers look great, it's quite interesting to see the impact of the auto-prefix parameters on the number of visited terms!

          Show
          Adrien Grand added a comment - The numbers look great, it's quite interesting to see the impact of the auto-prefix parameters on the number of visited terms!
          Hide
          Michael McCandless added a comment -

          Another patch, with all real nocommits removed (the only ones that
          remain are for DEBUG prints... I'll remove those before committing).
          I think this is close.

          I picked a "gentle" (does not add too many auto-prefix terms) default
          of minItemsInPrefix=100 and maxItemsInPrefix=Integer.MAX_VALUE.

          With these defaults, on the simple "random longs" test, the increase
          in index size for DOCS_ONLY fields is small (123.4 MB vs 137.69 MB:
          ~12%), much smaller than current default precStep=16 for long
          NumericField (363.19 MB vs 137.69 MB: ~62% smaller) and search time is
          faster (10.5 sec vs 6.0 sec: ~43% faster). Indexing time is also ~2X
          faster (29.96 sec vs 61.13 sec) than using default NumericField, but
          ~48% slower (20.31 sec vs 29.96 sec) than indexing simple binary
          terms.

          I think it's OK to just turn this on by default for all DOCS_ONLY
          fields... apps can always disable by passing minItemsInPrefix=0 to
          Lucene41PF or BlockTreeTermsWriter if necessary.

          Note that this optimization currently only kicks in for TermRangeQuery
          and PrefixQuery. It's also possible to enable it for certain wildcard
          (e.g. foo?x*) and regexp (e.g. foo[m-p]*) queries but this can wait.

          There is plenty more to explore here in how auto-prefix terms are
          assigned, e.g. like the multi-level postings skip lists, we could have
          different thresholds for the "level 1" (the prefix matches real terms)
          skip lists vs "level 2+" (the prefix also matches other prefix terms),
          but we can iterate later / get feedback from users.

          With this change an application can simply index numbers as binary
          terms (carefully flipping the sign bit so negative numbers sort
          before) or as fixed width terms (zero or space left-filled, e.g,
          000042) and then run "ordinary" TermRangeQuery on them, and should see
          good improvements in index size, indexing time, search performance vs
          NumericRangeFilter.

          Show
          Michael McCandless added a comment - Another patch, with all real nocommits removed (the only ones that remain are for DEBUG prints... I'll remove those before committing). I think this is close. I picked a "gentle" (does not add too many auto-prefix terms) default of minItemsInPrefix=100 and maxItemsInPrefix=Integer.MAX_VALUE. With these defaults, on the simple "random longs" test, the increase in index size for DOCS_ONLY fields is small (123.4 MB vs 137.69 MB: ~12%), much smaller than current default precStep=16 for long NumericField (363.19 MB vs 137.69 MB: ~62% smaller) and search time is faster (10.5 sec vs 6.0 sec: ~43% faster). Indexing time is also ~2X faster (29.96 sec vs 61.13 sec) than using default NumericField, but ~48% slower (20.31 sec vs 29.96 sec) than indexing simple binary terms. I think it's OK to just turn this on by default for all DOCS_ONLY fields... apps can always disable by passing minItemsInPrefix=0 to Lucene41PF or BlockTreeTermsWriter if necessary. Note that this optimization currently only kicks in for TermRangeQuery and PrefixQuery. It's also possible to enable it for certain wildcard (e.g. foo?x*) and regexp (e.g. foo[m-p]*) queries but this can wait. There is plenty more to explore here in how auto-prefix terms are assigned, e.g. like the multi-level postings skip lists, we could have different thresholds for the "level 1" (the prefix matches real terms) skip lists vs "level 2+" (the prefix also matches other prefix terms), but we can iterate later / get feedback from users. With this change an application can simply index numbers as binary terms (carefully flipping the sign bit so negative numbers sort before) or as fixed width terms (zero or space left-filled, e.g, 000042) and then run "ordinary" TermRangeQuery on them, and should see good improvements in index size, indexing time, search performance vs NumericRangeFilter.
          Hide
          Michael McCandless added a comment -

          Hmm, one problem here is ... with this patch we are inserting
          auto-prefix terms for numeric fields which is sort of silly (prefix
          terms of prefix terms). Not sure if/how to prevent that... I guess we
          could do something hacky and try to detect when writing the postings
          that it's a numeric field ... hmm.

          I removed the DEBUG nocommits and ran a trunk vs patch perf test on
          all English wikipedia "medium" docs (33.3M) and it didn't look like
          there was any real change, just noise.

          Show
          Michael McCandless added a comment - Hmm, one problem here is ... with this patch we are inserting auto-prefix terms for numeric fields which is sort of silly (prefix terms of prefix terms). Not sure if/how to prevent that... I guess we could do something hacky and try to detect when writing the postings that it's a numeric field ... hmm. I removed the DEBUG nocommits and ran a trunk vs patch perf test on all English wikipedia "medium" docs (33.3M) and it didn't look like there was any real change, just noise.
          Hide
          Michael McCandless added a comment -

          New patch: I added an option (default off) to FieldType
          for the application to "hint" that the field should be
          indexed for range querying/filtering if possible.
          It's experimental, only allowed if the field is DOCS_ONLY, and the
          javadocs explain that this is just a hint (postings format can
          ignore it entirely).

          This value is propagated to "auto write-once schema" FieldInfo.getIndexRanges(),
          which has same semantics as omitNorms ("once off always off"). This
          means if you want to try out this new feature on a field you must
          fully re-index (or use FilterAtomicReader to reprocess your entire
          index).

          Since this is now an "opt-in" feature, I also made the defaults more
          aggressive (25/48: same defaults for pinching off a block of terms,
          i.e. one prefix term per leaf term block).

          At these defaults index is 52% larger for random longs (187.08 MB vs
          baseline 123.4) but 48 % smaller than default (precStep=16) numeric
          field, and search time is 45% faster (5790 msec vs 10466 with NF
          precStep=16). Indexing speed is about the same as NF...

          Separately I also tested the "sequential IDs" case, indexing 100M ids in
          left-zero-prefixed sequential form (0001, 0002, ...); normally one
          wouldn't enable indexRanges for such a field (the "normal" use case is
          just PK lookup) but I still wanted to see how auto-prefix terms behave
          on such a "dense" term space:

          • Index was 14.1% larger (874 MB to 997 MB)
          • Indexing time was 2.5X slower (227 sec to 562 sec)

          Net/net I think sequential IDs / densely packed terms is an "easy"
          case since block tree can easily hit the max (default now 48) terms in
          each auto-prefix term. Also the postings compress very well since the
          docIDs are all adjacent (I indexed with a single thread).

          Tests pass, "ant precommit" passes, I think it's ready.

          Show
          Michael McCandless added a comment - New patch: I added an option (default off) to FieldType for the application to "hint" that the field should be indexed for range querying/filtering if possible. It's experimental, only allowed if the field is DOCS_ONLY, and the javadocs explain that this is just a hint (postings format can ignore it entirely). This value is propagated to "auto write-once schema" FieldInfo.getIndexRanges(), which has same semantics as omitNorms ("once off always off"). This means if you want to try out this new feature on a field you must fully re-index (or use FilterAtomicReader to reprocess your entire index). Since this is now an "opt-in" feature, I also made the defaults more aggressive (25/48: same defaults for pinching off a block of terms, i.e. one prefix term per leaf term block). At these defaults index is 52% larger for random longs (187.08 MB vs baseline 123.4) but 48 % smaller than default (precStep=16) numeric field, and search time is 45% faster (5790 msec vs 10466 with NF precStep=16). Indexing speed is about the same as NF... Separately I also tested the "sequential IDs" case, indexing 100M ids in left-zero-prefixed sequential form (0001, 0002, ...); normally one wouldn't enable indexRanges for such a field (the "normal" use case is just PK lookup) but I still wanted to see how auto-prefix terms behave on such a "dense" term space: Index was 14.1% larger (874 MB to 997 MB) Indexing time was 2.5X slower (227 sec to 562 sec) Net/net I think sequential IDs / densely packed terms is an "easy" case since block tree can easily hit the max (default now 48) terms in each auto-prefix term. Also the postings compress very well since the docIDs are all adjacent (I indexed with a single thread). Tests pass, "ant precommit" passes, I think it's ready.
          Hide
          Robert Muir added a comment -

          Can experimental please be added to indexablefieldtype and everywhere else?

          I am afraid if even a single spot is missed, it could prevent a future refactoring or cleanup that removes this (which really in the future belongs as a codec parameter, but this is good for now).

          Show
          Robert Muir added a comment - Can experimental please be added to indexablefieldtype and everywhere else? I am afraid if even a single spot is missed, it could prevent a future refactoring or cleanup that removes this (which really in the future belongs as a codec parameter, but this is good for now).
          Hide
          Michael McCandless added a comment -

          Can experimental please be added to indexablefieldtype and everywhere else?

          Woops I missed that one; I'll add it, and double check all the other places this boolean crawls to...

          Show
          Michael McCandless added a comment - Can experimental please be added to indexablefieldtype and everywhere else? Woops I missed that one; I'll add it, and double check all the other places this boolean crawls to...
          Hide
          Michael McCandless added a comment -

          I plan to commit this soon to 5.0... we can iterate on the "phase 2" items (making it easier during indexing to note that this field will be used for range searching).

          Show
          Michael McCandless added a comment - I plan to commit this soon to 5.0... we can iterate on the "phase 2" items (making it easier during indexing to note that this field will be used for range searching).
          Hide
          Uwe Schindler added a comment -

          And how about NRQ? The same? Currently its unsupported, because we have to figure out how to auto-disable trie terms and how to make NRQ autorewrite to a simple TermRange if no trie terms available and instead only autoprefix terms...

          Show
          Uwe Schindler added a comment - And how about NRQ? The same? Currently its unsupported, because we have to figure out how to auto-disable trie terms and how to make NRQ autorewrite to a simple TermRange if no trie terms available and instead only autoprefix terms...
          Hide
          Robert Muir added a comment -

          We should think about a migration plan for numerics?

          This should be a followup issue.

          Here are some thoughts.
          1. keep current trie "Encoding" for terms, it just uses precision step=Inf and lets the term dictionary do it automatically.
          2. create a filteratomicreader, that for a previous trie encoded field, removes "fake" terms on merge.

          Users could continue to use NumericRangeQuery just with the infinite precision step, and it will always work, just execute slower for old segments as it doesnt take advantage of the trie terms that are not yet merged away.

          One issue to making it really nice, is that lucene doesnt know for sure that a field is numeric, so it cannot be "full-auto". Apps would have to use their schema or whatever to wrap with this reader in their merge policy.

          Maybe we could provide some sugar for this, such as a wrapping merge policy that takes a list of field names that are numeric, or sugar to pass this to IWC in IndexUpgrader to force it, and so on.

          I think its complicated enough for a followup issue though.

          Show
          Robert Muir added a comment - We should think about a migration plan for numerics? This should be a followup issue. Here are some thoughts. 1. keep current trie "Encoding" for terms, it just uses precision step=Inf and lets the term dictionary do it automatically. 2. create a filteratomicreader, that for a previous trie encoded field, removes "fake" terms on merge. Users could continue to use NumericRangeQuery just with the infinite precision step, and it will always work, just execute slower for old segments as it doesnt take advantage of the trie terms that are not yet merged away. One issue to making it really nice, is that lucene doesnt know for sure that a field is numeric, so it cannot be "full-auto". Apps would have to use their schema or whatever to wrap with this reader in their merge policy. Maybe we could provide some sugar for this, such as a wrapping merge policy that takes a list of field names that are numeric, or sugar to pass this to IWC in IndexUpgrader to force it, and so on. I think its complicated enough for a followup issue though.
          Hide
          David Smiley added a comment -

          Wow, awesome work Mike! And fantastic idea Adrien!

          I just read through the comments but don't have time to dig into the source yet.

          • Is the auto-prefixing done on a per-segment basis or is it something different that has to do with Codec internals? It appears to be the latter.
          • Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable.
          • Would any CompiledAutomaton (e.g. a wildcard query) that has a leading prefix benefit from this or is it strictly Prefix & Range queries? Mike's comments suggest it will sometime but not yet. Can you create an issue for it, Mike? This would be especially useful in Lucene-spatial; I'm excited at the prospects!
          • When you iterate a TermsEnum, will the prefix terms be exposed or is it internal to the Codec?
          Show
          David Smiley added a comment - Wow, awesome work Mike! And fantastic idea Adrien! I just read through the comments but don't have time to dig into the source yet. Is the auto-prefixing done on a per-segment basis or is it something different that has to do with Codec internals? It appears to be the latter. Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable. Would any CompiledAutomaton (e.g. a wildcard query) that has a leading prefix benefit from this or is it strictly Prefix & Range queries? Mike's comments suggest it will sometime but not yet. Can you create an issue for it, Mike? This would be especially useful in Lucene-spatial; I'm excited at the prospects! When you iterate a TermsEnum, will the prefix terms be exposed or is it internal to the Codec?
          Hide
          David Smiley added a comment -

          Is the auto-prefixing done on a per-segment basis or is it something different that has to do with Codec internals? It appears to be the latter.

          I asked that in a confusing way. I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment?

          Show
          David Smiley added a comment - Is the auto-prefixing done on a per-segment basis or is it something different that has to do with Codec internals? It appears to be the latter. I asked that in a confusing way. I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment?
          Hide
          Michael McCandless added a comment -

          I like that plan Rob, except the term encoding that numeric field uses is somewhat wasteful: 1 byte used to encode the "shift", and then only 7 of 8 bits use for subsequent bytes ... but maybe we just live with that since it simplifies migration (both old and new can co-exist in one index).

          I'll open an issue for this.

          Show
          Michael McCandless added a comment - I like that plan Rob, except the term encoding that numeric field uses is somewhat wasteful: 1 byte used to encode the "shift", and then only 7 of 8 bits use for subsequent bytes ... but maybe we just live with that since it simplifies migration (both old and new can co-exist in one index). I'll open an issue for this.
          Hide
          Michael McCandless added a comment -

          OK I created LUCENE-5966 for the migration plan...

          Show
          Michael McCandless added a comment - OK I created LUCENE-5966 for the migration plan...
          Hide
          Michael McCandless added a comment -

          Wow, awesome work Mike! And fantastic idea Adrien!

          Thanks David Smiley

          I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment?

          It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

          Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable.

          I don't quite understand the question ... the indexed terms can be any variable length.

          Would any CompiledAutomaton (e.g. a wildcard query) that has a leading prefix benefit from this or is it strictly Prefix & Range queries? Mike's comments suggest it will sometime but not yet. Can you create an issue for it, Mike? This would be especially useful in Lucene-spatial; I'm excited at the prospects!

          Currently auto-prefix terms are only used for PrefixQuery and TermRangeQuery, or for any automaton query that "becomes" a PrefixQuery on rewrite (e.g. WildcardQuery("foo*")).

          Enabling them for WildcardQuery and RegexpQuery should be fairly easy, however they will only kick in in somewhat exotic situations, where there is a portion of the term space accepted by the automaton which "suddenly" accepts any suffix. E.g. foo*bar will never use auto-prefix terms, but foo?b* will.

          I'll open an issue!

          When you iterate a TermsEnum, will the prefix terms be exposed or is it internal to the Codec?

          No, these auto-prefix terms are invisible in all APIs, except when you call Terms.intersect.

          Show
          Michael McCandless added a comment - Wow, awesome work Mike! And fantastic idea Adrien! Thanks David Smiley I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment? It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms. Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable. I don't quite understand the question ... the indexed terms can be any variable length. Would any CompiledAutomaton (e.g. a wildcard query) that has a leading prefix benefit from this or is it strictly Prefix & Range queries? Mike's comments suggest it will sometime but not yet. Can you create an issue for it, Mike? This would be especially useful in Lucene-spatial; I'm excited at the prospects! Currently auto-prefix terms are only used for PrefixQuery and TermRangeQuery, or for any automaton query that "becomes" a PrefixQuery on rewrite (e.g. WildcardQuery("foo*")). Enabling them for WildcardQuery and RegexpQuery should be fairly easy, however they will only kick in in somewhat exotic situations, where there is a portion of the term space accepted by the automaton which "suddenly" accepts any suffix. E.g. foo*bar will never use auto-prefix terms, but foo?b* will. I'll open an issue! When you iterate a TermsEnum, will the prefix terms be exposed or is it internal to the Codec? No, these auto-prefix terms are invisible in all APIs, except when you call Terms.intersect.
          Hide
          Michael McCandless added a comment -

          OK I opened LUCENE-5967 to allow Wildcard/RegexpQuery to use auto-prefix terms...

          Show
          Michael McCandless added a comment - OK I opened LUCENE-5967 to allow Wildcard/RegexpQuery to use auto-prefix terms...
          Hide
          David Smiley added a comment -

          Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable.

          I don't quite understand the question ... the indexed terms can be any variable length.

          I simply mean is this useful outside of numeric fields?

          Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte?

          Show
          David Smiley added a comment - Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... etc. ? It appears this is applicable. I don't quite understand the question ... the indexed terms can be any variable length. I simply mean is this useful outside of numeric fields? Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte?
          Hide
          Michael McCandless added a comment -

          I simply mean is this useful outside of numeric fields?

          Oh, it's for any terms ... numeric or not.

          Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte?

          It's currently byte-boundary only, though this is an impl detail and we could do e.g. nibbles in the future ...

          Show
          Michael McCandless added a comment - I simply mean is this useful outside of numeric fields? Oh, it's for any terms ... numeric or not. Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte? It's currently byte-boundary only, though this is an impl detail and we could do e.g. nibbles in the future ...
          Hide
          Michael McCandless added a comment -

          New patch, folding in feedback from Rob.

          First off, nothing (!!) was testing that you could use one of the
          BOOLEAN_QUERY_REWRITE methods in MTQ when auto-prefix terms were
          enumeratd.

          When I added that, it trips this assert in ScoringRewrite:

                  assert reader.docFreq(term) == termStates[pos].docFreq();
          

          This trips when the term returned by IntersectTermsEnum is an
          auto-prefix term, in which case reader.docFreq cannot find the term
          since auto-prefix terms are invisible to all APIs except intersect.

          I fixed this by adding a hackish boolean isRealTerm to BlockTermState,
          and changed the assert to only check if it knows it's looking at real
          terms.

          However, I also had to fix AssertingAtomicReader.AssertingTermsEnum,
          to override seekExact(BytesRef, TermState) and termState() to delegate
          to the delegate (in) instead of to super ... super will fail because
          it falls back to seeking by term, which cannot work here because you
          can't seek by an auto-prefix term.

          This is a little scary ... because you are allowed to
          seekExact(BytesRef, TermState) for an auto-prefix term, meaning
          auto-prefix terms are in fact NOT invisible in this API. This is a
          possibly nasty trap, for any FilterAtomicReaders out there, that rely
          on super.... not sure what to do.

          Also, Rob noticed that CheckIndex no longer checks all bits ... this
          is pretty bad ... I put a nocommit to think about what to do ...

          Show
          Michael McCandless added a comment - New patch, folding in feedback from Rob. First off, nothing (!!) was testing that you could use one of the BOOLEAN_QUERY_REWRITE methods in MTQ when auto-prefix terms were enumeratd. When I added that, it trips this assert in ScoringRewrite: assert reader.docFreq(term) == termStates[pos].docFreq(); This trips when the term returned by IntersectTermsEnum is an auto-prefix term, in which case reader.docFreq cannot find the term since auto-prefix terms are invisible to all APIs except intersect. I fixed this by adding a hackish boolean isRealTerm to BlockTermState, and changed the assert to only check if it knows it's looking at real terms. However, I also had to fix AssertingAtomicReader.AssertingTermsEnum, to override seekExact(BytesRef, TermState) and termState() to delegate to the delegate (in) instead of to super ... super will fail because it falls back to seeking by term, which cannot work here because you can't seek by an auto-prefix term. This is a little scary ... because you are allowed to seekExact(BytesRef, TermState) for an auto-prefix term, meaning auto-prefix terms are in fact NOT invisible in this API. This is a possibly nasty trap, for any FilterAtomicReaders out there, that rely on super.... not sure what to do. Also, Rob noticed that CheckIndex no longer checks all bits ... this is pretty bad ... I put a nocommit to think about what to do ...
          Hide
          David Smiley added a comment - - edited

          Some more questions:

          It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

          So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense?

          At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

          Do you envision other posting formats being able to re-use the logic here? That would be nice.

          In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.

          Show
          David Smiley added a comment - - edited Some more questions: It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms. So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense? At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way. Do you envision other posting formats being able to re-use the logic here? That would be nice. In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.
          Hide
          Robert Muir added a comment -

          Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

          Its not that, its that there is absolutely no reason for making termsenum more complex in this way. it would accomplish nothing and would not make anything more efficient.

          These apis really should be minimal

          Show
          Robert Muir added a comment - Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way. Its not that, its that there is absolutely no reason for making termsenum more complex in this way. it would accomplish nothing and would not make anything more efficient. These apis really should be minimal
          Hide
          Michael McCandless added a comment -

          So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense?

          It's dynamic, based on how terms occupy the space.

          Today (and we can change this: it's an impl. detail) it assigns
          prefixes just like it assigns terms to blocks. Ie, when it sees a
          given prefix matches 25 - 48 terms, it inserts a new auto-prefix
          term. That auto-prefix term replaces those 25 - 48 other terms with 1
          term, and then the process "recurses", i.e. you can then have a
          shorter auto-prefix term matching 25 - 48 other normal and/or
          auto-prefix terms.

          At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

          What this patch does is generalize/relax Terms.intersect: that method
          no longer guarantees that you see "true" terms. Rather, it now
          guarantees only that the docIDs you visit will be the same.

          So to take advantage of it, you need to pass an Automaton to
          Terms.intersect and then not care about which terms you see, only the
          docIDs after visiting the DocsEnum across all terms it returned to
          you.

          Do you envision other posting formats being able to re-use the logic here? That would be nice.

          I agree it would be nice ... and the index-time logic that identifies
          the auto-prefix terms is quite standalone, so e.g. we could pull it
          out and have it "wrap" the incoming Fields to insert the auto-prefix
          terms. This way it's nearly transparent to any postings format ...

          But the problem is, at search time, there's tricky logic in
          intersect() to use these prefix terms ... factoring that out so other
          formats can use it is trickier I think... though maybe we could fold
          it into the default Terms.intersect() impl...

          In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.

          I agree this (how auto-prefix terms are assigned) needs more control /
          experimenting. Really the best prefixes are a function not only of
          how the terms were distributed, but also of how queries will "likely"
          ask for ranges.

          I think it's similar to postings skip lists, where we have different
          frequency of a skip pointer on the "leaf" level vs the "upper" skip
          levels.

          Show
          Michael McCandless added a comment - So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense? It's dynamic, based on how terms occupy the space. Today (and we can change this: it's an impl. detail) it assigns prefixes just like it assigns terms to blocks. Ie, when it sees a given prefix matches 25 - 48 terms, it inserts a new auto-prefix term. That auto-prefix term replaces those 25 - 48 other terms with 1 term, and then the process "recurses", i.e. you can then have a shorter auto-prefix term matching 25 - 48 other normal and/or auto-prefix terms. At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way. What this patch does is generalize/relax Terms.intersect: that method no longer guarantees that you see "true" terms. Rather, it now guarantees only that the docIDs you visit will be the same. So to take advantage of it, you need to pass an Automaton to Terms.intersect and then not care about which terms you see, only the docIDs after visiting the DocsEnum across all terms it returned to you. Do you envision other posting formats being able to re-use the logic here? That would be nice. I agree it would be nice ... and the index-time logic that identifies the auto-prefix terms is quite standalone, so e.g. we could pull it out and have it "wrap" the incoming Fields to insert the auto-prefix terms. This way it's nearly transparent to any postings format ... But the problem is, at search time, there's tricky logic in intersect() to use these prefix terms ... factoring that out so other formats can use it is trickier I think... though maybe we could fold it into the default Terms.intersect() impl... In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure. I agree this (how auto-prefix terms are assigned) needs more control / experimenting. Really the best prefixes are a function not only of how the terms were distributed, but also of how queries will "likely" ask for ranges. I think it's similar to postings skip lists, where we have different frequency of a skip pointer on the "leaf" level vs the "upper" skip levels.
          Hide
          Michael McCandless added a comment -

          New patch, just resolving conflicts from recent trunk changes ...

          I have an idea on improving CheckIndex to ferret out these auto prefix terms .... I'll explore it.

          Show
          Michael McCandless added a comment - New patch, just resolving conflicts from recent trunk changes ... I have an idea on improving CheckIndex to ferret out these auto prefix terms .... I'll explore it.
          Hide
          Michael McCandless added a comment -

          New patch, enhancing CheckIndex to test term ranges of varying sizes
          (number of terms). This uncovered back compat bug in the patch,
          now fixed. This isn't guaranteed to touch all auto-prefix terms but
          it should touch most.

          Net/net I think we should commit what we have here now, and get
          jenkins testing it. The new FieldType option is marked experimental
          (and has warnings in the javadocs) ... there are many things still to
          explore, and we need to sort out the migration from NumericField, but
          I think it's important to get this starting point to where users can
          try it out and then we iterate from there.

          Tests pass, precommit passes.

          I plan to commit soon...

          Show
          Michael McCandless added a comment - New patch, enhancing CheckIndex to test term ranges of varying sizes (number of terms). This uncovered back compat bug in the patch, now fixed. This isn't guaranteed to touch all auto-prefix terms but it should touch most. Net/net I think we should commit what we have here now, and get jenkins testing it. The new FieldType option is marked experimental (and has warnings in the javadocs) ... there are many things still to explore, and we need to sort out the migration from NumericField, but I think it's important to get this starting point to where users can try it out and then we iterate from there. Tests pass, precommit passes. I plan to commit soon...
          Hide
          Michael McCandless added a comment -

          But the problem is, at search time, there's tricky logic in intersect() to use these prefix terms ...

          One approach we might be able to take here is to more strongly
          separate out the auto-prefix terms from the ordinary terms (e.g., so
          they feel like two different fields, in the impl), such that intersect
          becomes something like a merge between the postings format's "normal"
          intersect, with these auto-prefix terms pulling from a shadow field.

          If we did this then other postings formats could also more easily add
          auto-prefix handling, and maybe even the default impl for
          Terms.intersect could use auto-prefix terms ...

          But we can try this later; this is a big enough change already.

          Show
          Michael McCandless added a comment - But the problem is, at search time, there's tricky logic in intersect() to use these prefix terms ... One approach we might be able to take here is to more strongly separate out the auto-prefix terms from the ordinary terms (e.g., so they feel like two different fields, in the impl), such that intersect becomes something like a merge between the postings format's "normal" intersect, with these auto-prefix terms pulling from a shadow field. If we did this then other postings formats could also more easily add auto-prefix handling, and maybe even the default impl for Terms.intersect could use auto-prefix terms ... But we can try this later; this is a big enough change already.
          Hide
          Robert Muir added a comment -

          One approach we might be able to take here is to more strongly
          separate out the auto-prefix terms from the ordinary terms (e.g., so
          they feel like two different fields, in the impl), such that intersect
          becomes something like a merge between the postings format's "normal"
          intersect, with these auto-prefix terms pulling from a shadow field.

          If we did this then other postings formats could also more easily add
          auto-prefix handling, and maybe even the default impl for
          Terms.intersect could use auto-prefix terms ...

          But we can try this later; this is a big enough change already.

          I agree it can be deferred, but I am worried if it gets dropped completely.

          Here, in order to make this easy to use, the feature is "temporarily" placed in fieldinfos with a warning.
          However, this is kind of screwed up when its currently a blocktree impl detail, and, well,
          its kinda complicated stuff to implement (no offense!). Realistically, if in 3 months, blocktree is still
          the only impl, how will people feel if I want to nuke this fieldinfo and fieldtype?

          Will they be -1 that i break spatial/solr/whatever starts using it (possibly even their source code compile)?
          Because I think the following situations are ok:

          Result #1

          • fieldinfo option (so its a "lucene feature" not a codec-specific feature)
          • "generic" impl so that all codecs actually support said feature, like everything else in fieldinfo

          Result #2

          • codec-specific option (so its just a blocktree option)

          But the following is not an ok "permanent situation":
          Result #3

          • fieldinfo option (so it advertises as a "lucene feature", not a codec-specific feature)
          • however, only blocktree actually supports the option
          • impossible to move on to next-gen termsdict because it doesnt support it
          • impossible to remove from fieldinfo because e.g. .document api/spatial/solr/old indexes/whatever rely upon it

          How can we prevent this from happening?

          Show
          Robert Muir added a comment - One approach we might be able to take here is to more strongly separate out the auto-prefix terms from the ordinary terms (e.g., so they feel like two different fields, in the impl), such that intersect becomes something like a merge between the postings format's "normal" intersect, with these auto-prefix terms pulling from a shadow field. If we did this then other postings formats could also more easily add auto-prefix handling, and maybe even the default impl for Terms.intersect could use auto-prefix terms ... But we can try this later; this is a big enough change already. I agree it can be deferred, but I am worried if it gets dropped completely. Here, in order to make this easy to use, the feature is "temporarily" placed in fieldinfos with a warning. However, this is kind of screwed up when its currently a blocktree impl detail, and, well, its kinda complicated stuff to implement (no offense!). Realistically, if in 3 months, blocktree is still the only impl, how will people feel if I want to nuke this fieldinfo and fieldtype? Will they be -1 that i break spatial/solr/whatever starts using it (possibly even their source code compile)? Because I think the following situations are ok: Result #1 fieldinfo option (so its a "lucene feature" not a codec-specific feature) "generic" impl so that all codecs actually support said feature, like everything else in fieldinfo Result #2 codec-specific option (so its just a blocktree option) But the following is not an ok "permanent situation": Result #3 fieldinfo option (so it advertises as a "lucene feature", not a codec-specific feature) however, only blocktree actually supports the option impossible to move on to next-gen termsdict because it doesnt support it impossible to remove from fieldinfo because e.g. .document api/spatial/solr/old indexes/whatever rely upon it How can we prevent this from happening?
          Hide
          Robert Muir added a comment -

          Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing.

          Show
          Robert Muir added a comment - Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing.
          Hide
          Michael McCandless added a comment -

          Thanks for reviewing Rob.

          How can we prevent this from happening?

          I think we shouldn't add the FI option at this time? We should only add it once we make it more generic so that all codec impls can easily support it?

          Show
          Michael McCandless added a comment - Thanks for reviewing Rob. How can we prevent this from happening? I think we shouldn't add the FI option at this time? We should only add it once we make it more generic so that all codec impls can easily support it?
          Hide
          Michael McCandless added a comment -

          Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing.

          I agree it's abusing FilterTermsEnum ... I'll fix these FilterLeafReader.FilterXXX classes to barf if they get null.

          We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ...

          I'll fix it to directly subclass TermsEnum and override all methods...

          Show
          Michael McCandless added a comment - Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing. I agree it's abusing FilterTermsEnum ... I'll fix these FilterLeafReader.FilterXXX classes to barf if they get null. We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ... I'll fix it to directly subclass TermsEnum and override all methods...
          Hide
          Robert Muir added a comment -

          We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ...

          I think thats the root of the problem causing my confusion? I guess TermsEnum is ok here, but its much more than "docsEnum that you can rewind" (and not obvious for that!). I think thats why i freaked out when i saw FixedBitSetTermsEnum passing null and only implementing one method, it just didnt make a lot of sense.

          Show
          Robert Muir added a comment - We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ... I think thats the root of the problem causing my confusion? I guess TermsEnum is ok here, but its much more than "docsEnum that you can rewind" (and not obvious for that!). I think thats why i freaked out when i saw FixedBitSetTermsEnum passing null and only implementing one method, it just didnt make a lot of sense.
          Hide
          Michael McCandless added a comment -

          Maybe we can somehow change the PostingsWriterBase API, so it only gets a "thingy that lets you pull docs/docsAndPositions as many times as you want" ... but I don't think that should block committing here?

          Show
          Michael McCandless added a comment - Maybe we can somehow change the PostingsWriterBase API, so it only gets a "thingy that lets you pull docs/docsAndPositions as many times as you want" ... but I don't think that should block committing here?
          Hide
          Robert Muir added a comment -

          I think its enough to just add a comment to explain what is happening.

          Its similar to seeing a "BitsetHashMap" or something. people are going to be very confused unless they have the 'rewind' explanation above.

          Show
          Robert Muir added a comment - I think its enough to just add a comment to explain what is happening. Its similar to seeing a "BitsetHashMap" or something. people are going to be very confused unless they have the 'rewind' explanation above.
          Hide
          Michael McCandless added a comment -

          OK I'll add a comment explaining it...

          Show
          Michael McCandless added a comment - OK I'll add a comment explaining it...
          Hide
          Michael McCandless added a comment -

          I think we shouldn't add the FI option at this time?

          New patch with FieldType.setIndexRanges removed, but I don't think we
          should commit this approach: the feature is basically so ridiculously
          expert to use that only like 3 people in the world will figure out
          how.

          Sure, the servers built on top of Lucene can expose a simple API,
          since they "know" the schema and can open up a "index for range
          searching" boolean on a field and validate you are using a PF that
          supports that... but I don't think it's right/fair to make new, strong
          features of Lucene ridiculously hard to use by direct Lucene users.

          It's wonderful Lucene has such pluggable codecs now, letting users
          explore all sorts of custom formats, etc., but the nasty downside of
          all this freedom is that new, complex features like this one, which
          offer powerful improvements to the default codec that 99% of Lucene
          users would have used, must either be implemented across the board for
          all codecs (a very tall order) in order to have an intuitive API, or
          must be exposed only via ridiculously expert codec-specific APIs.

          I don't think either choice is acceptable.

          So ... I tried exploring an uber helper/utility class, that lets you
          add "optimized for range/prefix" fields to docs, and "spies" on you to
          determine which fields should then use a customized PF, and then gives
          you sugar APIs to build range/prefix/equals queries... but even as
          best/simple as I can make this class it still feels way too
          weird/heavy/external/uncomittable.

          Maybe we should just go back to the "always index auto-prefix terms on
          DOCS_ONLY fields" even though 1) I had to then choose "weaker"
          defaults (less added index size; less performance gains), and 2) it's
          a total waste to add such terms to NumericFields and probably spatial
          fields which already build their own prefixes outside of Lucene. This
          is not a great solution either...

          Show
          Michael McCandless added a comment - I think we shouldn't add the FI option at this time? New patch with FieldType.setIndexRanges removed, but I don't think we should commit this approach: the feature is basically so ridiculously expert to use that only like 3 people in the world will figure out how. Sure, the servers built on top of Lucene can expose a simple API, since they "know" the schema and can open up a "index for range searching" boolean on a field and validate you are using a PF that supports that... but I don't think it's right/fair to make new, strong features of Lucene ridiculously hard to use by direct Lucene users. It's wonderful Lucene has such pluggable codecs now, letting users explore all sorts of custom formats, etc., but the nasty downside of all this freedom is that new, complex features like this one, which offer powerful improvements to the default codec that 99% of Lucene users would have used, must either be implemented across the board for all codecs (a very tall order) in order to have an intuitive API, or must be exposed only via ridiculously expert codec-specific APIs. I don't think either choice is acceptable. So ... I tried exploring an uber helper/utility class, that lets you add "optimized for range/prefix" fields to docs, and "spies" on you to determine which fields should then use a customized PF, and then gives you sugar APIs to build range/prefix/equals queries... but even as best/simple as I can make this class it still feels way too weird/heavy/external/uncomittable. Maybe we should just go back to the "always index auto-prefix terms on DOCS_ONLY fields" even though 1) I had to then choose "weaker" defaults (less added index size; less performance gains), and 2) it's a total waste to add such terms to NumericFields and probably spatial fields which already build their own prefixes outside of Lucene. This is not a great solution either...
          Hide
          Robert Muir added a comment -

          but the nasty downside of
          all this freedom is that new, complex features like this one, which
          offer powerful improvements to the default codec that 99% of Lucene
          users would have used, must either be implemented across the board for
          all codecs (a very tall order) in order to have an intuitive API, or
          must be exposed only via ridiculously expert codec-specific APIs.

          I don't think its a downside of the freedom, its just other problems.

          However, there are way way too many experimental codecs. These are even more costly to maintain than backwards ones in some ways: they are rotated in all tests! For many recent changes I have spent just as much time fixing those as i have on backwards codecs. If we ever want to provide backwards compatibility for experimental codecs (which seems to confuse users constantly that we can't do this), then we have to tone them down anyway.

          The existing trie-encoding is difficult to use, too. I dont think it should serve as your example for this feature. Remember that simple numeric range queries dont work with QP without the user doing subclassing, and numerics dont really work well with the parser at all because the analyzer is completely unaware of them (because, for some crazy reason, it is implemented as a tokenstream/special fields rather than being a more "ordinary" analysis chain integration).

          The .document API is overengineered. I dont understand why it needs to be so complicated. Because of it taking on more than it can chew already, its impossible to even think about how it could work with the codec api: and I think this is causing a lot of your frustration.

          The whole way that "lucene is schemaless" is fucking bogus, and only means that its on you, the user, to record and manage and track all this stuff yourself. Its no freedom to anyone, just pain. For example, we don't even know which fields have trie-encoded terms here, to do any kind of nice migration strategy from "old numerics" to this at all. Thats really sad and will cause users just more pain and confusion.

          FieldInfo is a hardcore place to add an experimental option when we arent even sure how it should behave yet (e.g. should it really be limited to DOCS_ONLY? who knows?)

          I can keep complaining too, we can rant about this stuff on this issue, but maybe you should commit what you have (yes, with the crappy hard-to-use codec option) so we can try to do something on another issue instead.

          Show
          Robert Muir added a comment - but the nasty downside of all this freedom is that new, complex features like this one, which offer powerful improvements to the default codec that 99% of Lucene users would have used, must either be implemented across the board for all codecs (a very tall order) in order to have an intuitive API, or must be exposed only via ridiculously expert codec-specific APIs. I don't think its a downside of the freedom, its just other problems. However, there are way way too many experimental codecs. These are even more costly to maintain than backwards ones in some ways: they are rotated in all tests! For many recent changes I have spent just as much time fixing those as i have on backwards codecs. If we ever want to provide backwards compatibility for experimental codecs (which seems to confuse users constantly that we can't do this), then we have to tone them down anyway. The existing trie-encoding is difficult to use, too. I dont think it should serve as your example for this feature. Remember that simple numeric range queries dont work with QP without the user doing subclassing, and numerics dont really work well with the parser at all because the analyzer is completely unaware of them (because, for some crazy reason, it is implemented as a tokenstream/special fields rather than being a more "ordinary" analysis chain integration). The .document API is overengineered. I dont understand why it needs to be so complicated. Because of it taking on more than it can chew already, its impossible to even think about how it could work with the codec api: and I think this is causing a lot of your frustration. The whole way that "lucene is schemaless" is fucking bogus, and only means that its on you, the user, to record and manage and track all this stuff yourself. Its no freedom to anyone, just pain. For example, we don't even know which fields have trie-encoded terms here, to do any kind of nice migration strategy from "old numerics" to this at all. Thats really sad and will cause users just more pain and confusion. FieldInfo is a hardcore place to add an experimental option when we arent even sure how it should behave yet (e.g. should it really be limited to DOCS_ONLY? who knows?) I can keep complaining too, we can rant about this stuff on this issue, but maybe you should commit what you have (yes, with the crappy hard-to-use codec option) so we can try to do something on another issue instead.
          Hide
          Michael McCandless added a comment -

          I think we can salvage parts of this patch by crumbling it up and breaking out separately committable pre-cursors ... I opened LUCENE-5879 to make it easier to index a single byte[] token.

          After that maybe we could break out the CheckIndex improvements for testing the Codec's impl of .intersect, adding Automata.makeBinaryRange.

          But there's still a huge gaping API hole, which is how the app can "easily"/"cleanly" notify Lucene that it intends to do range filtering on a given field: this is still a hard open question.

          Show
          Michael McCandless added a comment - I think we can salvage parts of this patch by crumbling it up and breaking out separately committable pre-cursors ... I opened LUCENE-5879 to make it easier to index a single byte[] token. After that maybe we could break out the CheckIndex improvements for testing the Codec's impl of .intersect, adding Automata.makeBinaryRange. But there's still a huge gaping API hole, which is how the app can "easily"/"cleanly" notify Lucene that it intends to do range filtering on a given field: this is still a hard open question.
          Hide
          Michael McCandless added a comment -

          I think for now we should simply commit auto-prefix as a dark
          feature (my last patch, which I hate!) ... progress not perfection.

          I.e. it's available only as an optional, disabled by default, postings
          format, and if you want to use it in your app you must use PerFieldPF
          to enable it for certain fields. You'll have to figure out how to
          index byte[] tokens, etc.

          I modernized the last patch (tons of stuff had changed) and carried
          over a fix for an issue I hit in LUCENE-6005. Patch applies to trunk,
          and bumps block tree's index format.

          With auto-prefix, numeric fields can simply be indexed as their
          obvious byte[] encoding and you just use TermRangeQuery at search
          time; TestAutoPrefix shows this.

          There are maybe some problems with the patch: is it horrible to store
          CompiledAutomaton on PrefixQuery/TermRangeQuery because of query
          caching...? Should I instead recompute it for every segment in
          .getTermsEnum? Or store it on the weight? Hmm or in a shared
          attribute, like FuzzyQuery (what a hack)?

          It allocates one FixedBitSet(maxDoc) at write time, per segment, to
          hold all docs matching each auto-prefix term ... maybe that's too
          costly? I could switch to more sparse impls (roaring, sparse,
          BitDocIdSet.Builder?) but I suspect typically we will require fairly
          dense bitsets anyway for the short prefixes. We end up OR'ing many
          terms together at write time...

          I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package
          private under oal.index, so I can send the bit set to PostingsConsumer
          at write time. Maybe there's a cleaner way?

          Maybe the changes should be moved to lucene/misc or lucene/codecs, not
          core? But this would mean yet another fork of block tree...

          It only works for IndexOptions.DOCS fields; I think that's fine?

          The added auto-prefix terms are not seen by normal postings APIs, they
          do not affect index stats, etc. They only kick in, as an
          implementation detail, when you call Terms.intersect(Automaton). The
          returned TermsEnum.term() can return an auto-prefix term, but
          LUCENE-5938 improves this since we now use
          MultiTermQueryConstantScoreWrapper by default.

          Show
          Michael McCandless added a comment - I think for now we should simply commit auto-prefix as a dark feature (my last patch, which I hate!) ... progress not perfection. I.e. it's available only as an optional, disabled by default, postings format, and if you want to use it in your app you must use PerFieldPF to enable it for certain fields. You'll have to figure out how to index byte[] tokens, etc. I modernized the last patch (tons of stuff had changed) and carried over a fix for an issue I hit in LUCENE-6005 . Patch applies to trunk, and bumps block tree's index format. With auto-prefix, numeric fields can simply be indexed as their obvious byte[] encoding and you just use TermRangeQuery at search time; TestAutoPrefix shows this. There are maybe some problems with the patch: is it horrible to store CompiledAutomaton on PrefixQuery/TermRangeQuery because of query caching...? Should I instead recompute it for every segment in .getTermsEnum? Or store it on the weight? Hmm or in a shared attribute, like FuzzyQuery (what a hack)? It allocates one FixedBitSet(maxDoc) at write time, per segment, to hold all docs matching each auto-prefix term ... maybe that's too costly? I could switch to more sparse impls (roaring, sparse, BitDocIdSet.Builder?) but I suspect typically we will require fairly dense bitsets anyway for the short prefixes. We end up OR'ing many terms together at write time... I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package private under oal.index, so I can send the bit set to PostingsConsumer at write time. Maybe there's a cleaner way? Maybe the changes should be moved to lucene/misc or lucene/codecs, not core? But this would mean yet another fork of block tree... It only works for IndexOptions.DOCS fields; I think that's fine? The added auto-prefix terms are not seen by normal postings APIs, they do not affect index stats, etc. They only kick in, as an implementation detail, when you call Terms.intersect(Automaton). The returned TermsEnum.term() can return an auto-prefix term, but LUCENE-5938 improves this since we now use MultiTermQueryConstantScoreWrapper by default.
          Hide
          Robert Muir added a comment -

          There are maybe some problems with the patch: is it horrible to store
          CompiledAutomaton on PrefixQuery/TermRangeQuery because of query
          caching...? Should I instead recompute it for every segment in
          .getTermsEnum? Or store it on the weight? Hmm or in a shared
          attribute, like FuzzyQuery (what a hack)?

          Why would this be an issue? Its still index-independent, e.g. AutomatonQuery does this too.

          It allocates one FixedBitSet(maxDoc) at write time, per segment, to
          hold all docs matching each auto-prefix term ... maybe that's too
          costly? I could switch to more sparse impls (roaring, sparse,
          BitDocIdSet.Builder?) but I suspect typically we will require fairly
          dense bitsets anyway for the short prefixes. We end up OR'ing many
          terms together at write time...

          If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI).

          It only works for IndexOptions.DOCS fields; I think that's fine?

          Yes, I think so, since the user gets an exception if they screw this up.

          I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package
          private under oal.index, so I can send the bit set to PostingsConsumer
          at write time. Maybe there's a cleaner way?

          I don't understand why these need to be tied to FixedBitSet. Seems like they can use the more generic Bitset api at least (nothing fixed-specific about them).

          Maybe the changes should be moved to lucene/misc or lucene/codecs, not
          core? But this would mean yet another fork of block tree...

          Alternatively, code could stay where it is. Lucene50PF wires zeros and doesnt have any options for now. in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something? This way when we feel comfortable, we can "expose" in the default index format by adding ctors there and removing the experimental one.

          Show
          Robert Muir added a comment - There are maybe some problems with the patch: is it horrible to store CompiledAutomaton on PrefixQuery/TermRangeQuery because of query caching...? Should I instead recompute it for every segment in .getTermsEnum? Or store it on the weight? Hmm or in a shared attribute, like FuzzyQuery (what a hack)? Why would this be an issue? Its still index-independent, e.g. AutomatonQuery does this too. It allocates one FixedBitSet(maxDoc) at write time, per segment, to hold all docs matching each auto-prefix term ... maybe that's too costly? I could switch to more sparse impls (roaring, sparse, BitDocIdSet.Builder?) but I suspect typically we will require fairly dense bitsets anyway for the short prefixes. We end up OR'ing many terms together at write time... If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI). It only works for IndexOptions.DOCS fields; I think that's fine? Yes, I think so, since the user gets an exception if they screw this up. I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package private under oal.index, so I can send the bit set to PostingsConsumer at write time. Maybe there's a cleaner way? I don't understand why these need to be tied to FixedBitSet. Seems like they can use the more generic Bitset api at least (nothing fixed-specific about them). Maybe the changes should be moved to lucene/misc or lucene/codecs, not core? But this would mean yet another fork of block tree... Alternatively, code could stay where it is. Lucene50PF wires zeros and doesnt have any options for now. in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something? This way when we feel comfortable, we can "expose" in the default index format by adding ctors there and removing the experimental one.
          Hide
          Robert Muir added a comment -

          The added auto-prefix terms are not seen by normal postings APIs, they
          do not affect index stats, etc. They only kick in, as an
          implementation detail, when you call Terms.intersect(Automaton). The
          returned TermsEnum.term() can return an auto-prefix term, but
          LUCENE-5938 improves this since we now use
          MultiTermQueryConstantScoreWrapper by default.

          I can't parse this. if you use a scoring rewrite, it still "works" right? Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct?

          Show
          Robert Muir added a comment - The added auto-prefix terms are not seen by normal postings APIs, they do not affect index stats, etc. They only kick in, as an implementation detail, when you call Terms.intersect(Automaton). The returned TermsEnum.term() can return an auto-prefix term, but LUCENE-5938 improves this since we now use MultiTermQueryConstantScoreWrapper by default. I can't parse this. if you use a scoring rewrite, it still "works" right? Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct?
          Hide
          Robert Muir added a comment -

          Can you fill in more detail about CompiledAutomaton.PREFIX?

          I definitely understand the RANGE case, its difficult to make the equiv automaton. But for the PREFIX case, its trivial. Why not just make PrefixQuery subclass AutomatonQuery?

          Show
          Robert Muir added a comment - Can you fill in more detail about CompiledAutomaton.PREFIX? I definitely understand the RANGE case, its difficult to make the equiv automaton. But for the PREFIX case, its trivial. Why not just make PrefixQuery subclass AutomatonQuery?
          Hide
          Michael McCandless added a comment -

          Thanks Rob.

          If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI).

          I coded this up, but then this Builder is angry because FreqProxDocsEnum throws UOE from its cost method, because this is actually not easy to implement (we don't track docFreq for each term inside IW's RAM buffer). I think for now we should just keep reusing a single FBS?

          Looking @ BitDocIdSet.Builder it looks like it could do better here? E.g. use this cost as just an approximation, but then since it had to .or() the actual set bits, record that as the "true" cost?

          Show
          Michael McCandless added a comment - Thanks Rob. If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI). I coded this up, but then this Builder is angry because FreqProxDocsEnum throws UOE from its cost method, because this is actually not easy to implement (we don't track docFreq for each term inside IW's RAM buffer). I think for now we should just keep reusing a single FBS? Looking @ BitDocIdSet.Builder it looks like it could do better here? E.g. use this cost as just an approximation, but then since it had to .or() the actual set bits, record that as the "true" cost?
          Hide
          Michael McCandless added a comment -

          I don't understand why these need to be tied to FixedBitSet

          I'll cutover to BitSet.

          Alternatively, code could stay where it is

          I'll leave it as is (dark addition to BlockTree).

          in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something?

          Good idea! I'll do this... this way Lucene50PF is unchanged.

          I can't parse this. if you use a scoring rewrite, it still "works" right?

          It does "work" (you get the right hits) ... TestPrefixQuery/TestTermRangeQuery
          randomly use SCORING_BOOLEAN_REWRITE and
          CONSTANT_SCORE_BOOLEAN_REWRITE.

          Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct?

          Right: they will use the auto-prefix terms, which have "correct" stats
          (i.e. docFreq is number of docs containing any term with this
          prefix). Is this too weird? It means you get different scores than
          you get today...

          We could maybe turn off auto-prefix if you use these rewrite methods?
          But this would need an API change to Terms, e.g. a new boolean
          allowAutoPrefix to Terms.intersect.

          I definitely understand the RANGE case, its difficult to make the equiv automaton.

          It's not so bad; I added Operations.makeBinaryInterval in the patch
          for this. It's like the decimal ranges that Automata.makeInterval already does.

          Why not just make PrefixQuery subclass AutomatonQuery?

          I explored this, but it turns out to be tricky, for those PFs that
          don't have auto prefix terms (use block tree)...

          I.e., with the patch as it is now, PFs like SimpleText will use a
          PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass
          AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText
          would use AutomatonTermsEnum (on a prefix automaton) which I think
          will be somewhat less efficient? Maybe it's not so bad in practice? ATE
          would realize it's in a "linear" part of the automaton...

          Maybe we can somehow simplify things here ... I agree both PrefixQuery
          and TermRangeQuery should ideally just subclass AutomatonQuery.

          Show
          Michael McCandless added a comment - I don't understand why these need to be tied to FixedBitSet I'll cutover to BitSet. Alternatively, code could stay where it is I'll leave it as is (dark addition to BlockTree). in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something? Good idea! I'll do this... this way Lucene50PF is unchanged. I can't parse this. if you use a scoring rewrite, it still "works" right? It does "work" (you get the right hits) ... TestPrefixQuery/TestTermRangeQuery randomly use SCORING_BOOLEAN_REWRITE and CONSTANT_SCORE_BOOLEAN_REWRITE. Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct? Right: they will use the auto-prefix terms, which have "correct" stats (i.e. docFreq is number of docs containing any term with this prefix). Is this too weird? It means you get different scores than you get today... We could maybe turn off auto-prefix if you use these rewrite methods? But this would need an API change to Terms, e.g. a new boolean allowAutoPrefix to Terms.intersect. I definitely understand the RANGE case, its difficult to make the equiv automaton. It's not so bad; I added Operations.makeBinaryInterval in the patch for this. It's like the decimal ranges that Automata.makeInterval already does. Why not just make PrefixQuery subclass AutomatonQuery? I explored this, but it turns out to be tricky, for those PFs that don't have auto prefix terms (use block tree)... I.e., with the patch as it is now, PFs like SimpleText will use a PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText would use AutomatonTermsEnum (on a prefix automaton) which I think will be somewhat less efficient? Maybe it's not so bad in practice? ATE would realize it's in a "linear" part of the automaton... Maybe we can somehow simplify things here ... I agree both PrefixQuery and TermRangeQuery should ideally just subclass AutomatonQuery.
          Hide
          Robert Muir added a comment -

          I.e., with the patch as it is now, PFs like SimpleText will use a
          PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass
          AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText
          would use AutomatonTermsEnum (on a prefix automaton) which I think
          will be somewhat less efficient? Maybe it's not so bad in practice? ATE
          would realize it's in a "linear" part of the automaton...

          We cannot continue writing code in this way.

          Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster.

          Show
          Robert Muir added a comment - I.e., with the patch as it is now, PFs like SimpleText will use a PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText would use AutomatonTermsEnum (on a prefix automaton) which I think will be somewhat less efficient? Maybe it's not so bad in practice? ATE would realize it's in a "linear" part of the automaton... We cannot continue writing code in this way. Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster.
          Hide
          Michael McCandless added a comment -

          We cannot continue writing code in this way.

          Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster.

          Can you maybe be more specific? I'm having trouble following exactly what you're objecting to.

          Terms.intersect default impl is already specializing to PrefixTermsEnum in the patch.

          You don't want the added ctor that takes a prefix term in CompiledAutomaton but you are OK with PREFIX/RANGE in CA.AUTOMATON_TYPE?

          If I 1) remove the added ctor that takes the prefix term in CA, and 2) fix PrefixQuery to subclass AutomatonQuery (meaning CA must "autodetect" when it receives a prefix automaton), would that address your concerns? Or something else...?

          I still wonder if just using AutomatonTermsEnum for prefix/range will be fine. Then we don't need PREFIX nor RANGE in CA.AUTOMATON_TYPE.

          I'll open a separate issue for this...

          Show
          Michael McCandless added a comment - We cannot continue writing code in this way. Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster. Can you maybe be more specific? I'm having trouble following exactly what you're objecting to. Terms.intersect default impl is already specializing to PrefixTermsEnum in the patch. You don't want the added ctor that takes a prefix term in CompiledAutomaton but you are OK with PREFIX/RANGE in CA.AUTOMATON_TYPE? If I 1) remove the added ctor that takes the prefix term in CA, and 2) fix PrefixQuery to subclass AutomatonQuery (meaning CA must "autodetect" when it receives a prefix automaton), would that address your concerns? Or something else...? I still wonder if just using AutomatonTermsEnum for prefix/range will be fine. Then we don't need PREFIX nor RANGE in CA.AUTOMATON_TYPE. I'll open a separate issue for this...
          Hide
          Robert Muir added a comment -

          Yes, everything you propose makes sense (especially the last point you make, that would be fantastic!)

          High level I just feel here that we have the second use case where codec can do "special" stuff with intersect and we should be removing these specializations in our code, and just be passing the structure to the codec. I do realize this is already messy in trunk, but I think we need to remove a lot of this complexity.

          At the very least I think PrefixQuery shouldn't be a "backdoor automaton query"

          Show
          Robert Muir added a comment - Yes, everything you propose makes sense (especially the last point you make, that would be fantastic!) High level I just feel here that we have the second use case where codec can do "special" stuff with intersect and we should be removing these specializations in our code, and just be passing the structure to the codec. I do realize this is already messy in trunk, but I think we need to remove a lot of this complexity. At the very least I think PrefixQuery shouldn't be a "backdoor automaton query"
          Hide
          Michael McCandless added a comment -

          New patch, folding in changes after LUCENE-6367, and also cutting over
          TermRangeQuery to AutomatonQuery.

          Now the changes to CompiledAutomaton are minimal, just the addition of
          the sinkState.

          I have a few minor nocommits left ... otherwise I think this is
          ready.

          Show
          Michael McCandless added a comment - New patch, folding in changes after LUCENE-6367 , and also cutting over TermRangeQuery to AutomatonQuery. Now the changes to CompiledAutomaton are minimal, just the addition of the sinkState. I have a few minor nocommits left ... otherwise I think this is ready.
          Hide
          Michael McCandless added a comment -

          New patch, fixing the nocommits. I think it's ready ... I'll beast tests for a while on it.

          I don't think we should rush this into 5.1.

          Show
          Michael McCandless added a comment - New patch, fixing the nocommits. I think it's ready ... I'll beast tests for a while on it. I don't think we should rush this into 5.1.
          Hide
          Adrien Grand added a comment -

          +1 to the patch

          I don't think we should rush this into 5.1.

          +1

          Show
          Adrien Grand added a comment - +1 to the patch I don't think we should rush this into 5.1. +1
          Hide
          Robert Muir added a comment -

          We should be able to add a trivial test to lucene/codecs that extends BasePostingsFormatTestCase for this new PF ? What about "putting it into rotation" in RandomCodec? These things would give us a lot of testing.

          Show
          Robert Muir added a comment - We should be able to add a trivial test to lucene/codecs that extends BasePostingsFormatTestCase for this new PF ? What about "putting it into rotation" in RandomCodec? These things would give us a lot of testing.
          Hide
          Robert Muir added a comment -

          +1 to the patch. I can add the tests later if you want. but they should be trivial.

          Show
          Robert Muir added a comment - +1 to the patch. I can add the tests later if you want. but they should be trivial.
          Hide
          Michael McCandless added a comment -

          Another iteration, adding a test case as Rob suggested. It was somewhat tricky because BasePFTestCase tests all IndexOptions but this new PF only supports DOCS.

          So, I factored out a RandomPostingsTester (in test-framework) from BasePFTestCase, which lets you specify which IndexOptions to test, and then added TestAutoPrefixPF to use that.

          This process managed to find a couple latent bugs in BasePostingsFormatTestCase's SeedPostings!

          Tests seem to pass ... I ran distributed beasting for 17 iters.

          I think it's ready.

          Show
          Michael McCandless added a comment - Another iteration, adding a test case as Rob suggested. It was somewhat tricky because BasePFTestCase tests all IndexOptions but this new PF only supports DOCS. So, I factored out a RandomPostingsTester (in test-framework) from BasePFTestCase, which lets you specify which IndexOptions to test, and then added TestAutoPrefixPF to use that. This process managed to find a couple latent bugs in BasePostingsFormatTestCase's SeedPostings! Tests seem to pass ... I ran distributed beasting for 17 iters. I think it's ready.
          Hide
          Robert Muir added a comment -

          +1

          I thought those tests were gonna be easy... but the refactoring of the test is great. Thanks!

          Show
          Robert Muir added a comment - +1 I thought those tests were gonna be easy... but the refactoring of the test is great. Thanks!
          Hide
          ASF subversion and git services added a comment -

          Commit 1670918 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1670918 ]

          LUCENE-5879: add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat

          Show
          ASF subversion and git services added a comment - Commit 1670918 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1670918 ] LUCENE-5879 : add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat
          Hide
          Michael McCandless added a comment -

          I committed to trunk ... I'll let it bake a bit before backporting to 5.x (5.2).

          Show
          Michael McCandless added a comment - I committed to trunk ... I'll let it bake a bit before backporting to 5.x (5.2).
          Hide
          ASF subversion and git services added a comment -

          Commit 1670923 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1670923 ]

          LUCENE-5879: fix test compilation (this enum no longer exists)

          Show
          ASF subversion and git services added a comment - Commit 1670923 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1670923 ] LUCENE-5879 : fix test compilation (this enum no longer exists)
          Hide
          Michael McCandless added a comment -

          Woops ... the "term range checking" I added to CheckIndex here is way, way too costly: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html

          That's on an index that has zero auto-prefix terms ...

          I'll turn this off for now and mull how to fix it. We could at least skip this when that field has no auto-prefix terms, but it will still be costly when the field does have auto-prefix terms because it does a rather exhaustive comparison of the "full" term space iteration (does not use auto-prefix terms) against the Terms.intersect one (does use them)....

          Show
          Michael McCandless added a comment - Woops ... the "term range checking" I added to CheckIndex here is way, way too costly: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html That's on an index that has zero auto-prefix terms ... I'll turn this off for now and mull how to fix it. We could at least skip this when that field has no auto-prefix terms, but it will still be costly when the field does have auto-prefix terms because it does a rather exhaustive comparison of the "full" term space iteration (does not use auto-prefix terms) against the Terms.intersect one (does use them)....
          Hide
          ASF subversion and git services added a comment -

          Commit 1671380 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1671380 ]

          LUCENE-5879: turn off too-slow term range checking for now

          Show
          ASF subversion and git services added a comment - Commit 1671380 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1671380 ] LUCENE-5879 : turn off too-slow term range checking for now
          Hide
          Michael McCandless added a comment -

          I'll backport this to 5.2 soon ... trunk jenkins hasn't uncovered any serious issues so far ...

          Show
          Michael McCandless added a comment - I'll backport this to 5.2 soon ... trunk jenkins hasn't uncovered any serious issues so far ...
          Hide
          ASF subversion and git services added a comment -

          Commit 1671765 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1671765 ]

          LUCENE-5879: add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat

          Show
          ASF subversion and git services added a comment - Commit 1671765 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671765 ] LUCENE-5879 : add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat
          Hide
          ASF subversion and git services added a comment -

          Commit 1671766 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1671766 ]

          LUCENE-5879: move CHANGES entry to 5.2.0

          Show
          ASF subversion and git services added a comment - Commit 1671766 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1671766 ] LUCENE-5879 : move CHANGES entry to 5.2.0
          Hide
          ASF subversion and git services added a comment -

          Commit 1672037 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1672037 ]

          LUCENE-5879: fix corner case in auto-prefix intersect

          Show
          ASF subversion and git services added a comment - Commit 1672037 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1672037 ] LUCENE-5879 : fix corner case in auto-prefix intersect
          Hide
          ASF subversion and git services added a comment -

          Commit 1672042 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1672042 ]

          LUCENE-5879: fix corner case in auto-prefix intersect

          Show
          ASF subversion and git services added a comment - Commit 1672042 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1672042 ] LUCENE-5879 : fix corner case in auto-prefix intersect
          Hide
          ASF subversion and git services added a comment -

          Commit 1672262 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1672262 ]

          LUCENE-5879: fix empty string corner case

          Show
          ASF subversion and git services added a comment - Commit 1672262 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1672262 ] LUCENE-5879 : fix empty string corner case
          Hide
          ASF subversion and git services added a comment -

          Commit 1672263 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1672263 ]

          LUCENE-5879: fix empty string corner case

          Show
          ASF subversion and git services added a comment - Commit 1672263 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1672263 ] LUCENE-5879 : fix empty string corner case
          Hide
          ASF subversion and git services added a comment -

          Commit 1672749 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1672749 ]

          LUCENE-5879: turn off too-slow term range checking for now

          Show
          ASF subversion and git services added a comment - Commit 1672749 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1672749 ] LUCENE-5879 : turn off too-slow term range checking for now
          Hide
          ASF subversion and git services added a comment -

          Commit 1673009 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1673009 ]

          LUCENE-5879: fix finite case of Automata.makeBinaryIterval, improve tests

          Show
          ASF subversion and git services added a comment - Commit 1673009 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1673009 ] LUCENE-5879 : fix finite case of Automata.makeBinaryIterval, improve tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1673075 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1673075 ]

          LUCENE-5879: fix ob1 that caused OOME in test when min and max auto-prefix terms was 2; attempt to simplify empty string case

          Show
          ASF subversion and git services added a comment - Commit 1673075 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1673075 ] LUCENE-5879 : fix ob1 that caused OOME in test when min and max auto-prefix terms was 2; attempt to simplify empty string case
          Hide
          ASF subversion and git services added a comment -

          Commit 1674027 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1674027 ]

          LUCENE-5879: fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time

          Show
          ASF subversion and git services added a comment - Commit 1674027 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674027 ] LUCENE-5879 : fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time
          Hide
          ASF subversion and git services added a comment -

          Commit 1674029 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1674029 ]

          LUCENE-5879: fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time

          Show
          ASF subversion and git services added a comment - Commit 1674029 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1674029 ] LUCENE-5879 : fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development