Lucene - Core
  1. Lucene - Core
  2. LUCENE-2588

terms index should not store useless suffixes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This idea came up when discussing w/ Robert how to improve our terms index...

      The terms dict index today simply grabs whatever term was at a 0 mod 128 index (by default).

      But this is wasteful because you often don't need the suffix of the term at that point.

      EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, instead of storing that full term you only need to store ab. The suffix is useless, and uses up RAM since we load the terms index into RAM.

      The patch is very simple. The optimization is particularly easy because terms are now byte[] and we sort in binary order.

      I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index (tii) file from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer, indexing body field tokenized but title / date fields untokenized). I expect on noisier terms dicts, especially ones w/ bad terms accidentally indexed, that the savings will be even more.

      In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say "approximately" every N, but favor terms that have a smaller net prefix. We can also index more sparsely in regions where the net docFreq is lowish, since we can afford somewhat higher seek+scan time to these terms since enuming their docs will be much faster.

      1. LUCENE-2588.patch
        4 kB
        Simon Willnauer
      2. LUCENE-2588.patch
        4 kB
        Michael McCandless
      3. LUCENE-2588.patch
        4 kB
        Michael McCandless
      4. LUCENE-2588.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Attached patch.

          I also fixed int numTerms -> long numTerms which was a latent bug.

          Show
          Michael McCandless added a comment - Attached patch. I also fixed int numTerms -> long numTerms which was a latent bug.
          Hide
          Yonik Seeley added a comment -

          Ah sweet, nice idea!

          Show
          Yonik Seeley added a comment - Ah sweet, nice idea!
          Hide
          Robert Muir added a comment -

          i think this patch as-is is a good improvement (at least as a defensive measure against "noise" terms and other things). it also seems to buy more savings on the non-latin data i tested (60kb -> 40kb). +1 to commit

          In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say "approximately" every N, but favor terms that have a smaller net prefix

          I think we should explore this in the future. "randomly" selecting every N terms isn't optimal when allowing a "fudge" of the interval maybe +/- 5 or 10% could intentionally select terms that differ very quickly from their previous term, without wasting a bunch of cpu or unbalancing the terms index...

          if additional smarts like this could save enough size on average maybe we could rethink lowering the default interval of 128?

          Show
          Robert Muir added a comment - i think this patch as-is is a good improvement (at least as a defensive measure against "noise" terms and other things). it also seems to buy more savings on the non-latin data i tested (60kb -> 40kb). +1 to commit In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say "approximately" every N, but favor terms that have a smaller net prefix I think we should explore this in the future. "randomly" selecting every N terms isn't optimal when allowing a "fudge" of the interval maybe +/- 5 or 10% could intentionally select terms that differ very quickly from their previous term, without wasting a bunch of cpu or unbalancing the terms index... if additional smarts like this could save enough size on average maybe we could rethink lowering the default interval of 128?
          Hide
          Michael McCandless added a comment -

          if additional smarts like this could save enough size on average maybe we could rethink lowering the default interval of 128?

          +1

          With flex we've already very substantially reduced the RAM cost, GC load and init time of the terms index. We no longer create a TermInfo, Term, String objects per indexed term.

          So, I think even without these further mods we could drop the default index interval to maybe 32.

          Show
          Michael McCandless added a comment - if additional smarts like this could save enough size on average maybe we could rethink lowering the default interval of 128? +1 With flex we've already very substantially reduced the RAM cost, GC load and init time of the terms index. We no longer create a TermInfo, Term, String objects per indexed term. So, I think even without these further mods we could drop the default index interval to maybe 32.
          Hide
          Michael McCandless added a comment -

          New patch, changes default TII to 32.

          Show
          Michael McCandless added a comment - New patch, changes default TII to 32.
          Hide
          Michael McCandless added a comment -

          Allowing the term index interval to be variable (eg to find "short min prefix" terms to further improve savings from this opto, or to balance index terms by net docFreq of terms in each block) makes seek-by-ord more complex.

          Ie, today, because indexed terms are every N, seek by ord is a straight divide/mod. But with variable spaced index terms, I think we'd have to hold the ord of each indexed term in RAM, probably as packed int array so added RAM cost shouldn't be too bad. Then a lookup by ord would require a binary search then scan.

          Show
          Michael McCandless added a comment - Allowing the term index interval to be variable (eg to find "short min prefix" terms to further improve savings from this opto, or to balance index terms by net docFreq of terms in each block) makes seek-by-ord more complex. Ie, today, because indexed terms are every N, seek by ord is a straight divide/mod. But with variable spaced index terms, I think we'd have to hold the ord of each indexed term in RAM, probably as packed int array so added RAM cost shouldn't be too bad. Then a lookup by ord would require a binary search then scan.
          Hide
          Robert Muir added a comment -

          But with variable spaced index terms, I think we'd have to hold the ord of each indexed term in RAM, probably as packed int array so added RAM cost shouldn't be too bad. Then a lookup by ord would require a binary search then scan.

          Well, I was suggesting this mainly to try to justify lowering default TII, but if we can lower it
          separately as you propose, but keep TII constant and not have to deal with this tradeoff, I think thats great!

          Show
          Robert Muir added a comment - But with variable spaced index terms, I think we'd have to hold the ord of each indexed term in RAM, probably as packed int array so added RAM cost shouldn't be too bad. Then a lookup by ord would require a binary search then scan. Well, I was suggesting this mainly to try to justify lowering default TII, but if we can lower it separately as you propose, but keep TII constant and not have to deal with this tradeoff, I think thats great!
          Hide
          Michael McCandless added a comment -

          Unfortunately, it's not quite as simple to do this on 3.x.

          On trunk, an indexed term points to the file pointer for that term's entry in the tis file, and we then read the absolute (term and frq/prx pointers) at that entry.

          But in 3.x, all entries are deltas (even the seek points), and, the indexed term points to the entry after itself in the tis file, which requires that the term in memory be the full term (unless we change 3.x's index format to also store the absolutes at the seek points).

          Show
          Michael McCandless added a comment - Unfortunately, it's not quite as simple to do this on 3.x. On trunk, an indexed term points to the file pointer for that term's entry in the tis file, and we then read the absolute (term and frq/prx pointers) at that entry. But in 3.x, all entries are deltas (even the seek points), and, the indexed term points to the entry after itself in the tis file, which requires that the term in memory be the full term (unless we change 3.x's index format to also store the absolutes at the seek points).
          Hide
          Simon Willnauer added a comment -

          Reopening this because this optimization is not safe for all BytesRef comparators. Reverse unicode order already breaks it. See LUCENE-2622 for details. The non-distinguishable suffix must be determined by the actually used Comparator otherwise the assumption might be wrong for non-standard sort order.

          Show
          Simon Willnauer added a comment - Reopening this because this optimization is not safe for all BytesRef comparators. Reverse unicode order already breaks it. See LUCENE-2622 for details. The non-distinguishable suffix must be determined by the actually used Comparator otherwise the assumption might be wrong for non-standard sort order.
          Hide
          Robert Muir added a comment -

          Should we really change StandardCodec to support this [non-binary order]?

          Really if you have anything but regular unicode order, other things in lucene will break too, such as queries.
          The test just doesnt test these. Try changing the order of PreFlexCodec's comparator...

          Can't we just fix the test not to use StandardCodec? I mean we aren't taking any feature away here.

          Show
          Robert Muir added a comment - Should we really change StandardCodec to support this [non-binary order] ? Really if you have anything but regular unicode order, other things in lucene will break too, such as queries. The test just doesnt test these. Try changing the order of PreFlexCodec's comparator... Can't we just fix the test not to use StandardCodec? I mean we aren't taking any feature away here.
          Hide
          Simon Willnauer added a comment -

          Should we really change StandardCodec to support this [non-binary order]?

          I'm not sure if we should do, but we should at least document the limitation. People who work with that level do also read doc strings - if they don't let them be doomed but if you run into the bug we had in LUCENE-2622 you will have a super hard time to figure out what is going on without knowing lucene very very well.

          Can't we just fix the test not to use StandardCodec? I mean we aren't taking any feature away here.

          +1 I think we should fix this test ASAP with either using byte sort order or add some MockCodec (what robert has suggested).

          Show
          Simon Willnauer added a comment - Should we really change StandardCodec to support this [non-binary order] ? I'm not sure if we should do, but we should at least document the limitation. People who work with that level do also read doc strings - if they don't let them be doomed but if you run into the bug we had in LUCENE-2622 you will have a super hard time to figure out what is going on without knowing lucene very very well. Can't we just fix the test not to use StandardCodec? I mean we aren't taking any feature away here. +1 I think we should fix this test ASAP with either using byte sort order or add some MockCodec (what robert has suggested).
          Hide
          Michael McCandless added a comment -

          After I commit the simple renaming of standard codec's terms dicts (LUCENE-2647), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written).

          Since StandardCodec hardwires the term sort to unicode order, the opto is safe there.

          In general, if a codec uses a different term sort (such as this test's codec) it's conceivable a different opto could apply. EG I think this test could prune suffix based on the term after the index term. But, it makes no sense to spend time exploring this until a "real" use case arrives... this is just a simple test to assert that a codec is in fact free to customize the sort order.

          Also, there are other fun optos we could explore w/ terms index. EG we could "wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq – eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been > N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs.

          Show
          Michael McCandless added a comment - After I commit the simple renaming of standard codec's terms dicts ( LUCENE-2647 ), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written). Since StandardCodec hardwires the term sort to unicode order, the opto is safe there. In general, if a codec uses a different term sort (such as this test's codec) it's conceivable a different opto could apply. EG I think this test could prune suffix based on the term after the index term. But, it makes no sense to spend time exploring this until a "real" use case arrives... this is just a simple test to assert that a codec is in fact free to customize the sort order. Also, there are other fun optos we could explore w/ terms index. EG we could "wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq – eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been > N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs.
          Hide
          Simon Willnauer added a comment -

          After I commit the simple renaming of standard codec's terms dicts (LUCENE-2647), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written).

          Mike what about factoring out a method like

          protected short indexTermPrefixLen(BytesRef lastTerm, BytesRef currentTerm){
            ...
          }
          
          

          then we can simply override that method if there is a comparator which can not utilize / breaks this opto?

          Show
          Simon Willnauer added a comment - After I commit the simple renaming of standard codec's terms dicts ( LUCENE-2647 ), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written). Mike what about factoring out a method like protected short indexTermPrefixLen(BytesRef lastTerm, BytesRef currentTerm){ ... } then we can simply override that method if there is a comparator which can not utilize / breaks this opto?
          Hide
          Michael McCandless added a comment -

          I think that's good! I'll take that approach.

          Show
          Michael McCandless added a comment - I think that's good! I'll take that approach.
          Hide
          Robert Muir added a comment -

          Also, there are other fun optos we could explore w/ terms index. EG we could "wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq - eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been > N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs.

          it would be great to come up with a heuristic that balances all 3 of these:
          Because selecting % 32 is silly if it would give you "abracadabra" when the previous term is "a" and a fudge would give you a smaller index term (of course it depends too, on what the next index term would be, and the docfreq optimization too).

          It sounds tricky, but right now we are just selecting index terms with no basis at all (essentially random). then we are trying to deal with bad selections by trimming wasted suffixes, etc.

          Show
          Robert Muir added a comment - Also, there are other fun optos we could explore w/ terms index. EG we could "wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq - eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been > N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs. it would be great to come up with a heuristic that balances all 3 of these: Because selecting % 32 is silly if it would give you "abracadabra" when the previous term is "a" and a fudge would give you a smaller index term (of course it depends too, on what the next index term would be, and the docfreq optimization too). It sounds tricky, but right now we are just selecting index terms with no basis at all (essentially random). then we are trying to deal with bad selections by trimming wasted suffixes, etc.
          Hide
          Michael McCandless added a comment -

          Attached patch – factors out the opto into a separate method (indexedTermPrefixLength), and leaves the opto enabled by default. Then I fixed TestExternalCodecs to override & disable the opto, and the test then passes w/ the seed that failed before.

          Show
          Michael McCandless added a comment - Attached patch – factors out the opto into a separate method (indexedTermPrefixLength), and leaves the opto enabled by default. Then I fixed TestExternalCodecs to override & disable the opto, and the test then passes w/ the seed that failed before.
          Hide
          Simon Willnauer added a comment -

          Mike this looks good, just a little nit-picking

          protected int indexedTermPrefixLength(final BytesRef priorTerm, final BytesRef indexedTerm) {
              // As long as codec sorts terms in unicode codepoint
              // order, we can safely strip off the non-distinguishing
              // suffix to save RAM in the loaded terms index.
              final int idxOffset = indexedTerm.offset;
              final int priorOffset = priorTerm.offset;
              final int limit = Math.min(priorTerm.length, indexedTerm.length);
              for(int byteIdx=0;byteIdx<limit;byteIdx++) {
                if (priorTerm.bytes[priorOffset+byteIdx] != indexedTerm.bytes[idxOffset+byteIdx]) {
                  return byteIdx+1;
                }
              }
              return Math.min(1+priorTerm.length, indexedTerm.length);;
          }
          

          I know this is not super hot code but maybe we wanna safe the Math.min() method call if not necessary and public member deref is also an unnecessary indirection. On a low termIndexInterval and lot of long terms like shingles we might save some cycles

          I mean that would't make a big difference just mentioning it though.

          Show
          Simon Willnauer added a comment - Mike this looks good, just a little nit-picking protected int indexedTermPrefixLength( final BytesRef priorTerm, final BytesRef indexedTerm) { // As long as codec sorts terms in unicode codepoint // order, we can safely strip off the non-distinguishing // suffix to save RAM in the loaded terms index. final int idxOffset = indexedTerm.offset; final int priorOffset = priorTerm.offset; final int limit = Math .min(priorTerm.length, indexedTerm.length); for ( int byteIdx=0;byteIdx<limit;byteIdx++) { if (priorTerm.bytes[priorOffset+byteIdx] != indexedTerm.bytes[idxOffset+byteIdx]) { return byteIdx+1; } } return Math .min(1+priorTerm.length, indexedTerm.length);; } I know this is not super hot code but maybe we wanna safe the Math.min() method call if not necessary and public member deref is also an unnecessary indirection. On a low termIndexInterval and lot of long terms like shingles we might save some cycles I mean that would't make a big difference just mentioning it though.
          Hide
          Michael McCandless added a comment -

          Sure – we can fix those. You wanna take a stab?

          Show
          Michael McCandless added a comment - Sure – we can fix those. You wanna take a stab?
          Hide
          Simon Willnauer added a comment -

          attached nit-picky iteration. Both failures from LUCENE-2622 pass now - I will commit soon if nobody objects

          Show
          Simon Willnauer added a comment - attached nit-picky iteration. Both failures from LUCENE-2622 pass now - I will commit soon if nobody objects
          Hide
          Simon Willnauer added a comment -

          Committed revision 998675.

          Show
          Simon Willnauer added a comment - Committed revision 998675.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development