Lucene - Core
  1. Lucene - Core
  2. LUCENE-5127

FixedGapTermsIndex should use monotonic compression

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      for the addresses in the big in-memory byte[] and disk blocks, we could save a good deal of RAM here.

      I think this codec just never got upgraded when we added these new packed improvements, but it might be interesting to try to use for the terms data of sorted/sortedset DV implementations.

      patch works, but has nocommits and currently ignores the divisor. The annoying problem there being that we have the shared interface with "get(int)" for PackedInts.Mutable/Reader, but no equivalent base class for monotonics get(long)...

      Still its enough that we could benchmark/compare for now.

      1. LUCENE-5127.patch
        143 kB
        Robert Muir
      2. LUCENE-5127.patch
        143 kB
        Robert Muir
      3. LUCENE-5127.patch
        94 kB
        Robert Muir
      4. LUCENE-5127.patch
        6 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          +1

          I think we should just nuke the divisor?

          Show
          Michael McCandless added a comment - +1 I think we should just nuke the divisor?
          Hide
          Robert Muir added a comment -

          Maybe, though we could also add a minimal get(long) interface to blockpacked/monotonicblockpacked/appending/monotonicappending.

          A few notes:

          • Current patch changes both the disk offsets (termsDictOffsets) and the offsets into the in-ram terms data (termOffsets)
          • With the current patch as-is, we could remove the interval*2B #terms limitation, as long addressing is used everywhere.
          • Current patch saves RAM, savings increase as termsindex/termsdict gets larger. With 10M:
            Checkout TIB TII
            Trunk 519329144 19300603
            Patch 519329144 14149524
          • Current patch slows down seek-heavy queries a bit:
                                Task   QPS trunk      StdDev   QPS patch      StdDev                Pct diff
                            PKLookup       86.02      (2.9%)       76.17      (2.4%)  -11.4% ( -16% -   -6%)
                             Respell       39.76      (3.0%)       36.58      (2.5%)   -8.0% ( -13% -   -2%)
                              Fuzzy2       35.49      (4.1%)       32.88      (2.6%)   -7.3% ( -13% -    0%)
                              Fuzzy1       31.49      (4.1%)       29.18      (2.6%)   -7.3% ( -13% -    0%)
            
          • termOffsets are read twice per seek / binary search iteration:
                  final long offset = fieldIndex.termOffsets.get(idx);
                  final int length = (int) (fieldIndex.termOffsets.get(1+idx) - offset);
            
          • termsDictOffsets are only read once... and this is really just an unfortunate consequence of TermsIndexReaderBase's API... ideally they would lazy-decode this until you really needed it, like BlockTree.

          So I see a few things we could do:

          1. go forward with current patch (maybe add the divisor stuff via a simple get() interface). clean up int->long everywhere. I'm not sure if these perf diffs matter for the use cases where someone needs an ord-enabled terms index?
          2. hybrid patch, where termOffsets stay "absolute" but termDictOffsets use monotonicpacked. This would still save some space, but restore the seek-heavy perf. But then we wouldnt be able to cleanup int->long and so on.
          3. do nothing, maybe "fork" the logic of this thing so it can be used in DV. For how DV is used, it'd be the right tradeoff so its no issue there.
          Show
          Robert Muir added a comment - Maybe, though we could also add a minimal get(long) interface to blockpacked/monotonicblockpacked/appending/monotonicappending. A few notes: Current patch changes both the disk offsets (termsDictOffsets) and the offsets into the in-ram terms data (termOffsets) With the current patch as-is, we could remove the interval*2B #terms limitation, as long addressing is used everywhere. Current patch saves RAM, savings increase as termsindex/termsdict gets larger. With 10M: Checkout TIB TII Trunk 519329144 19300603 Patch 519329144 14149524 Current patch slows down seek-heavy queries a bit: Task QPS trunk StdDev QPS patch StdDev Pct diff PKLookup 86.02 (2.9%) 76.17 (2.4%) -11.4% ( -16% - -6%) Respell 39.76 (3.0%) 36.58 (2.5%) -8.0% ( -13% - -2%) Fuzzy2 35.49 (4.1%) 32.88 (2.6%) -7.3% ( -13% - 0%) Fuzzy1 31.49 (4.1%) 29.18 (2.6%) -7.3% ( -13% - 0%) termOffsets are read twice per seek / binary search iteration: final long offset = fieldIndex.termOffsets.get(idx); final int length = ( int ) (fieldIndex.termOffsets.get(1+idx) - offset); termsDictOffsets are only read once... and this is really just an unfortunate consequence of TermsIndexReaderBase's API... ideally they would lazy-decode this until you really needed it, like BlockTree. So I see a few things we could do: go forward with current patch (maybe add the divisor stuff via a simple get() interface). clean up int->long everywhere. I'm not sure if these perf diffs matter for the use cases where someone needs an ord-enabled terms index? hybrid patch, where termOffsets stay "absolute" but termDictOffsets use monotonicpacked. This would still save some space, but restore the seek-heavy perf. But then we wouldnt be able to cleanup int->long and so on. do nothing, maybe "fork" the logic of this thing so it can be used in DV. For how DV is used, it'd be the right tradeoff so its no issue there.
          Hide
          Robert Muir added a comment -

          I made some progress...

          Finally clean up divisor and interval, which are only confusing to users since they have done nothing in the default codec for so long: and in 5.x we dont have to read any preflex indexes.

          this makes interval a codec parameter for fixedgap and so on (like blocktree's min/max). this is cleaner and more flexible anyway, because it means e.g. if you use one of these codecs you can specify it per-field in the usual ways rather than globally for the whole index.

          the fieldcache-like divisor is gone. As far as the special -1 value, i didnt yet clean this up, but i see two directions. The best IMO is to nuke the mergeReader shit from ReadersAndLiveDocs completely. Otherwise we keep it and codecs can do special shit based on IOContext, but in all cases we dont need a special param.

          tests are passing (at least once). More cleanups are needed to some of the codec impls, and some of the special case tests for corner-case bugs in the past (e.g. TII0+empty field name) should really be moved to fix-gap specific unit tests.

          Show
          Robert Muir added a comment - I made some progress... Finally clean up divisor and interval, which are only confusing to users since they have done nothing in the default codec for so long: and in 5.x we dont have to read any preflex indexes. this makes interval a codec parameter for fixedgap and so on (like blocktree's min/max). this is cleaner and more flexible anyway, because it means e.g. if you use one of these codecs you can specify it per-field in the usual ways rather than globally for the whole index. the fieldcache-like divisor is gone. As far as the special -1 value, i didnt yet clean this up, but i see two directions. The best IMO is to nuke the mergeReader shit from ReadersAndLiveDocs completely. Otherwise we keep it and codecs can do special shit based on IOContext, but in all cases we dont need a special param. tests are passing (at least once). More cleanups are needed to some of the codec impls, and some of the special case tests for corner-case bugs in the past (e.g. TII0+empty field name) should really be moved to fix-gap specific unit tests.
          Hide
          ASF subversion and git services added a comment -

          Commit 1507035 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507035 ]

          LUCENE-5127: create branch

          Show
          ASF subversion and git services added a comment - Commit 1507035 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507035 ] LUCENE-5127 : create branch
          Hide
          ASF subversion and git services added a comment -

          Commit 1507036 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507036 ]

          LUCENE-5127: dump current state

          Show
          ASF subversion and git services added a comment - Commit 1507036 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507036 ] LUCENE-5127 : dump current state
          Hide
          ASF subversion and git services added a comment -

          Commit 1507041 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507041 ]

          LUCENE-5127: randomize codec parameter

          Show
          ASF subversion and git services added a comment - Commit 1507041 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507041 ] LUCENE-5127 : randomize codec parameter
          Hide
          ASF subversion and git services added a comment -

          Commit 1507054 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507054 ]

          LUCENE-5127: fix solr tests

          Show
          ASF subversion and git services added a comment - Commit 1507054 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507054 ] LUCENE-5127 : fix solr tests
          Hide
          Michael McCandless added a comment -

          This cleanup is awesome, thanks Rob!

          I think we should just nuke the special -1 "don't load terms index" value?

          Show
          Michael McCandless added a comment - This cleanup is awesome, thanks Rob! I think we should just nuke the special -1 "don't load terms index" value?
          Hide
          ASF subversion and git services added a comment -

          Commit 1507067 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507067 ]

          LUCENE-5127: nuke mergeReader

          Show
          ASF subversion and git services added a comment - Commit 1507067 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507067 ] LUCENE-5127 : nuke mergeReader
          Hide
          ASF subversion and git services added a comment -

          Commit 1507070 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507070 ]

          LUCENE-5127: simplify seek-within-block

          Show
          ASF subversion and git services added a comment - Commit 1507070 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507070 ] LUCENE-5127 : simplify seek-within-block
          Hide
          ASF subversion and git services added a comment -

          Commit 1507075 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507075 ]

          LUCENE-5127: explicit var gap testing part 1

          Show
          ASF subversion and git services added a comment - Commit 1507075 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507075 ] LUCENE-5127 : explicit var gap testing part 1
          Hide
          ASF subversion and git services added a comment -

          Commit 1507078 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507078 ]

          LUCENE-5127: explicit var gap testing part 2

          Show
          ASF subversion and git services added a comment - Commit 1507078 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507078 ] LUCENE-5127 : explicit var gap testing part 2
          Hide
          ASF subversion and git services added a comment -

          Commit 1507083 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507083 ]

          LUCENE-5127: simplify vargap

          Show
          ASF subversion and git services added a comment - Commit 1507083 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507083 ] LUCENE-5127 : simplify vargap
          Hide
          ASF subversion and git services added a comment -

          Commit 1507086 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507086 ]

          LUCENE-5127: simplify fixedgap

          Show
          ASF subversion and git services added a comment - Commit 1507086 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507086 ] LUCENE-5127 : simplify fixedgap
          Hide
          ASF subversion and git services added a comment -

          Commit 1507087 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507087 ]

          LUCENE-5127: fix indent

          Show
          ASF subversion and git services added a comment - Commit 1507087 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507087 ] LUCENE-5127 : fix indent
          Hide
          ASF subversion and git services added a comment -

          Commit 1507097 from Michael McCandless in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507097 ]

          LUCENE-5127: add tests

          Show
          ASF subversion and git services added a comment - Commit 1507097 from Michael McCandless in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507097 ] LUCENE-5127 : add tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1507111 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507111 ]

          LUCENE-5127: clear nocommits

          Show
          ASF subversion and git services added a comment - Commit 1507111 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507111 ] LUCENE-5127 : clear nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 1507116 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507116 ]

          LUCENE-5127: fix TestLucene40PF and clean up some more outdated stuff

          Show
          ASF subversion and git services added a comment - Commit 1507116 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507116 ] LUCENE-5127 : fix TestLucene40PF and clean up some more outdated stuff
          Hide
          ASF subversion and git services added a comment -

          Commit 1507118 from Michael McCandless in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507118 ]

          LUCENE-5127: fix false fail when terms dict is a ghostbuster

          Show
          ASF subversion and git services added a comment - Commit 1507118 from Michael McCandless in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507118 ] LUCENE-5127 : fix false fail when terms dict is a ghostbuster
          Hide
          ASF subversion and git services added a comment -

          Commit 1507120 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507120 ]

          LUCENE-5127: clean up error msgs

          Show
          ASF subversion and git services added a comment - Commit 1507120 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507120 ] LUCENE-5127 : clean up error msgs
          Hide
          ASF subversion and git services added a comment -

          Commit 1507179 from Robert Muir in branch 'dev/branches/lucene5127'
          [ https://svn.apache.org/r1507179 ]

          LUCENE-5127: use less ram when writing the terms index

          Show
          ASF subversion and git services added a comment - Commit 1507179 from Robert Muir in branch 'dev/branches/lucene5127' [ https://svn.apache.org/r1507179 ] LUCENE-5127 : use less ram when writing the terms index
          Hide
          Robert Muir added a comment -

          Patch for trunk, i think its ready.

          Show
          Robert Muir added a comment - Patch for trunk, i think its ready.
          Hide
          Michael McCandless added a comment -

          +1, patch looks great. Thanks Rob!

          Show
          Michael McCandless added a comment - +1, patch looks great. Thanks Rob!
          Hide
          Adrien Grand added a comment -

          This is a very nice cleanup! In FixedGapTermsIndexWriter, I think we could improve the buffering of offsets and addresses by directly buffering into a MonotonicBlockPackedWriter over a RamOutputStream, and then copy the raw content of the RamOutputStream to the IndexOutput? This would avoid an extra encoding/decoding step.

          Show
          Adrien Grand added a comment - This is a very nice cleanup! In FixedGapTermsIndexWriter, I think we could improve the buffering of offsets and addresses by directly buffering into a MonotonicBlockPackedWriter over a RamOutputStream, and then copy the raw content of the RamOutputStream to the IndexOutput? This would avoid an extra encoding/decoding step.
          Hide
          Robert Muir added a comment -

          Good idea! I initially thought of the growableoutput, but i didnt want all the resizing. I think a RamOutputStream can work well.

          Show
          Robert Muir added a comment - Good idea! I initially thought of the growableoutput, but i didnt want all the resizing. I think a RamOutputStream can work well.
          Hide
          Robert Muir added a comment -

          patch with RAMOutputStream approach (so we don't compress/uncompress/recompress)

          Show
          Robert Muir added a comment - patch with RAMOutputStream approach (so we don't compress/uncompress/recompress)
          Hide
          Adrien Grand added a comment -

          +1

          Show
          Adrien Grand added a comment - +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1508147 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1508147 ]

          LUCENE-5127: FixedGapTermsIndex should use monotonic compression

          Show
          ASF subversion and git services added a comment - Commit 1508147 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1508147 ] LUCENE-5127 : FixedGapTermsIndex should use monotonic compression
          Hide
          Robert Muir added a comment -

          resolving for trunk only. I think the situation is already confusing in 4.x and backporting seems risky...

          Show
          Robert Muir added a comment - resolving for trunk only. I think the situation is already confusing in 4.x and backporting seems risky...
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development