Lucene - Core
  1. Lucene - Core
  2. LUCENE-3729

Allow using FST to hold terms data in DocValues.BYTES_*_SORTED

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 5.0
    • Component/s: None
    • Lucene Fields:
      New
    1. LUCENE-3729.patch
      27 kB
      Simon Willnauer
    2. LUCENE-3729.patch
      14 kB
      Simon Willnauer
    3. LUCENE-3729.patch
      7 kB
      Michael McCandless
    4. LUCENE-3729.patch
      6 kB
      Michael McCandless
    5. LUCENE-3729.patch
      4 kB
      Michael McCandless

      Issue Links

        Activity

        Hide
        Michael McCandless added a comment -

        Prototype patch just for testing...

        As a quick test for viability here... I hacked
        FieldCacheImpl.DocTermsIndexImpl, to build an FST to map term <-> ord,
        and changed the lookup method to use the new Util.getByOutput method.

        Then I tested perf on 10M docs from Wikipedia:

                        Task    QPS base StdDev base   QPS fstfcStdDev fstfc      Pct diff
                 TermGroup1M       47.75        1.59       25.75        0.36  -48% -  -43%
                TermBGroup1M       17.10        0.58       14.20        0.37  -21% -  -11%
                    PKLookup      158.73        6.07      155.84        3.00   -7% -    4%
               TermTitleSort       43.49        2.54       42.73        1.84  -11% -    8%
                     Respell       81.13        3.24       80.67        3.83   -8% -    8%
                        Term      106.13        3.59      106.03        1.28   -4% -    4%
              TermBGroup1M1P       25.31        0.44       25.37        0.54   -3% -    4%
                      Fuzzy2       55.32        1.21       55.76        2.55   -5% -    7%
                      Fuzzy1       74.06        1.21       74.88        2.80   -4% -    6%
                SloppyPhrase        9.82        0.61        9.95        0.42   -8% -   12%
                    SpanNear        3.39        0.16        3.47        0.15   -6% -   12%
                      Phrase        9.29        0.69        9.66        0.69  -10% -   20%
                    Wildcard       20.15        0.66       21.23        0.46    0% -   11%
                 AndHighHigh       13.43        0.55       14.24        0.70   -3% -   15%
                     Prefix3       10.05        0.53       10.70        0.19    0% -   14%
                  AndHighMed       56.62        3.36       60.54        4.28   -6% -   21%
                   OrHighMed       25.78        0.98       27.75        1.51   -1% -   17%
                  OrHighHigh       10.97        0.41       11.82        0.63   -1% -   17%
                      IntNRQ        9.74        0.81       10.83        0.26    0% -   24%
        

        Two-pass grouping took a big hit... and single-pass grouping a moderate
        hit... but TermTitleSort was a minor slowdown, which is good news.

        The net RAM required across all segs for the title field FST was 30.2
        MB, vs 46.5 MB for the current FieldCache terms storage (PagedBytes +
        PackedInts), which is ~35% less.

        The FST for the group-by fields was quite a bit larger (~60%) RAM
        usage than PagedBytes + PackedInts, because these fields are actually
        randomly generated unicode strings...

        I didn't make the change to use the FST for term -> ord lookup (have
        to fix the binarySearchLookup method), but we really should do this
        "for real" because it's doing an unnecessary binary search (repeated
        ord -> term lookup) now. Ie, perf should be better than
        above... grouping is a heavier user of binarySearchLookup than sorting
        so it should help recover some of that slowdown.

        Also, Util.getByOutput currently doesn't optimize for array'd
        arcs... so if we fix that we should get some small perf gain.

        To do this "for real" I think we should do it only with DocValues,
        because the FST build time is relatively costly.

        Show
        Michael McCandless added a comment - Prototype patch just for testing... As a quick test for viability here... I hacked FieldCacheImpl.DocTermsIndexImpl, to build an FST to map term <-> ord, and changed the lookup method to use the new Util.getByOutput method. Then I tested perf on 10M docs from Wikipedia: Task QPS base StdDev base QPS fstfcStdDev fstfc Pct diff TermGroup1M 47.75 1.59 25.75 0.36 -48% - -43% TermBGroup1M 17.10 0.58 14.20 0.37 -21% - -11% PKLookup 158.73 6.07 155.84 3.00 -7% - 4% TermTitleSort 43.49 2.54 42.73 1.84 -11% - 8% Respell 81.13 3.24 80.67 3.83 -8% - 8% Term 106.13 3.59 106.03 1.28 -4% - 4% TermBGroup1M1P 25.31 0.44 25.37 0.54 -3% - 4% Fuzzy2 55.32 1.21 55.76 2.55 -5% - 7% Fuzzy1 74.06 1.21 74.88 2.80 -4% - 6% SloppyPhrase 9.82 0.61 9.95 0.42 -8% - 12% SpanNear 3.39 0.16 3.47 0.15 -6% - 12% Phrase 9.29 0.69 9.66 0.69 -10% - 20% Wildcard 20.15 0.66 21.23 0.46 0% - 11% AndHighHigh 13.43 0.55 14.24 0.70 -3% - 15% Prefix3 10.05 0.53 10.70 0.19 0% - 14% AndHighMed 56.62 3.36 60.54 4.28 -6% - 21% OrHighMed 25.78 0.98 27.75 1.51 -1% - 17% OrHighHigh 10.97 0.41 11.82 0.63 -1% - 17% IntNRQ 9.74 0.81 10.83 0.26 0% - 24% Two-pass grouping took a big hit... and single-pass grouping a moderate hit... but TermTitleSort was a minor slowdown, which is good news. The net RAM required across all segs for the title field FST was 30.2 MB, vs 46.5 MB for the current FieldCache terms storage (PagedBytes + PackedInts), which is ~35% less. The FST for the group-by fields was quite a bit larger (~60%) RAM usage than PagedBytes + PackedInts, because these fields are actually randomly generated unicode strings... I didn't make the change to use the FST for term -> ord lookup (have to fix the binarySearchLookup method), but we really should do this "for real" because it's doing an unnecessary binary search (repeated ord -> term lookup) now. Ie, perf should be better than above... grouping is a heavier user of binarySearchLookup than sorting so it should help recover some of that slowdown. Also, Util.getByOutput currently doesn't optimize for array'd arcs... so if we fix that we should get some small perf gain. To do this "for real" I think we should do it only with DocValues, because the FST build time is relatively costly.
        Hide
        Michael McCandless added a comment -

        Updated patch to use FST for term -> ord lookup during 2 pass grouping... much better results:

                        Task    QPS base StdDev base   QPS fstfcStdDev fstfc      Pct diff
                 TermGroup1M       45.69        1.25       43.43        0.90   -9% -    0%
                    PKLookup      164.02        9.00      157.12        6.23  -12% -    5%
                     Respell       64.28        2.75       61.96        2.48  -11% -    4%
                SloppyPhrase        5.99        0.26        5.79        0.35  -13% -    7%
                      Fuzzy2       54.07        2.47       52.56        1.83  -10% -    5%
                      Phrase       14.97        0.16       14.61        0.44   -6% -    1%
               TermTitleSort       39.53        0.56       38.71        1.93   -8% -    4%
                   OrHighMed       32.91        1.33       32.27        1.48  -10% -    6%
                  OrHighHigh       15.10        0.62       14.83        0.68   -9% -    7%
                  AndHighMed       53.49        0.56       52.53        2.02   -6% -    3%
                    SpanNear       14.28        0.21       14.04        0.28   -5% -    1%
                      Fuzzy1       60.37        1.25       59.39        1.65   -6% -    3%
                 AndHighHigh        9.15        0.12        9.01        0.25   -5% -    2%
                TermBGroup1M       33.53        0.61       33.07        0.77   -5% -    2%
                      IntNRQ       10.16        0.47       10.04        0.90  -14% -   12%
                     Prefix3       40.44        0.59       40.44        1.99   -6% -    6%
                    Wildcard       35.38        0.63       35.55        1.49   -5% -    6%
              TermBGroup1M1P       16.54        0.34       16.78        0.54   -3% -    6%
                        Term      100.39        2.64      103.13        3.69   -3% -    9%
        
        Show
        Michael McCandless added a comment - Updated patch to use FST for term -> ord lookup during 2 pass grouping... much better results: Task QPS base StdDev base QPS fstfcStdDev fstfc Pct diff TermGroup1M 45.69 1.25 43.43 0.90 -9% - 0% PKLookup 164.02 9.00 157.12 6.23 -12% - 5% Respell 64.28 2.75 61.96 2.48 -11% - 4% SloppyPhrase 5.99 0.26 5.79 0.35 -13% - 7% Fuzzy2 54.07 2.47 52.56 1.83 -10% - 5% Phrase 14.97 0.16 14.61 0.44 -6% - 1% TermTitleSort 39.53 0.56 38.71 1.93 -8% - 4% OrHighMed 32.91 1.33 32.27 1.48 -10% - 6% OrHighHigh 15.10 0.62 14.83 0.68 -9% - 7% AndHighMed 53.49 0.56 52.53 2.02 -6% - 3% SpanNear 14.28 0.21 14.04 0.28 -5% - 1% Fuzzy1 60.37 1.25 59.39 1.65 -6% - 3% AndHighHigh 9.15 0.12 9.01 0.25 -5% - 2% TermBGroup1M 33.53 0.61 33.07 0.77 -5% - 2% IntNRQ 10.16 0.47 10.04 0.90 -14% - 12% Prefix3 40.44 0.59 40.44 1.99 -6% - 6% Wildcard 35.38 0.63 35.55 1.49 -5% - 6% TermBGroup1M1P 16.54 0.34 16.78 0.54 -3% - 6% Term 100.39 2.64 103.13 3.69 -3% - 9%
        Hide
        Robert Muir added a comment -

        Interesting that it hurts sort and grouping so little... some questions:

        • don't you think this should be 4.0's default sorted bytes implementation?
        • alternatively, since this test hurts sort so little, maybe we could try
          every n'th (e.g. 128) term going into the FST? So this would just be a
          'sort index' like the terms index, augmenting the existing sorted bytes
          (you could still load em all up in ram for other purposes, like merging).
          But we could avoid this (mostly) for sort and hopefully resolve most
          comparisons with the fst "index", maybe only rarely going to the DirectSource...
        Show
        Robert Muir added a comment - Interesting that it hurts sort and grouping so little... some questions: don't you think this should be 4.0's default sorted bytes implementation? alternatively, since this test hurts sort so little, maybe we could try every n'th (e.g. 128) term going into the FST? So this would just be a 'sort index' like the terms index, augmenting the existing sorted bytes (you could still load em all up in ram for other purposes, like merging). But we could avoid this (mostly) for sort and hopefully resolve most comparisons with the fst "index", maybe only rarely going to the DirectSource...
        Hide
        Michael McCandless added a comment -

        New patch, still just prototyping on FC, but now all tests pass.

        I enabled packing and the wikipedia title data is now ~43.8% smaller than what FC does today (PagedBytes + PackedInts).

        Results are about the same as before:

                    PKLookup      127.87        2.68      115.22        5.37  -15% -   -3%
               TermTitleSort       69.77        4.32       64.91        2.62  -15% -    3%
                TermBGroup1M       35.85        1.03       34.49        0.92   -8% -    1%
                 TermGroup1M       26.58        0.72       26.13        0.38   -5% -    2%
                     Respell       75.04        2.63       74.28        1.33   -6% -    4%
                      Fuzzy1       86.35        1.93       86.27        1.34   -3% -    3%
                      Phrase       18.92        0.57       18.94        0.57   -5% -    6%
                    SpanNear        1.46        0.02        1.46        0.05   -4% -    5%
                SloppyPhrase       15.85        0.69       15.93        0.69   -7% -    9%
                      Fuzzy2       31.37        0.61       31.65        0.53   -2% -    4%
              TermBGroup1M1P       44.99        1.32       45.47        0.74   -3% -    5%
                  AndHighMed       40.22        1.00       41.43        0.32    0% -    6%
                    Wildcard       26.11        1.15       27.15        0.21   -1% -    9%
                  OrHighHigh        6.14        0.42        6.40        0.34   -7% -   17%
                   OrHighMed       10.65        0.72       11.10        0.60   -7% -   17%
                 AndHighHigh        9.16        0.33        9.56        0.04    0% -    8%
                     Prefix3       43.07        2.32       45.34        0.55   -1% -   12%
                        Term       34.11        1.60       36.09        1.19   -2% -   14%
                      IntNRQ        7.66        0.64        8.22        0.57   -7% -   25%
        
        Show
        Michael McCandless added a comment - New patch, still just prototyping on FC, but now all tests pass. I enabled packing and the wikipedia title data is now ~43.8% smaller than what FC does today (PagedBytes + PackedInts). Results are about the same as before: PKLookup 127.87 2.68 115.22 5.37 -15% - -3% TermTitleSort 69.77 4.32 64.91 2.62 -15% - 3% TermBGroup1M 35.85 1.03 34.49 0.92 -8% - 1% TermGroup1M 26.58 0.72 26.13 0.38 -5% - 2% Respell 75.04 2.63 74.28 1.33 -6% - 4% Fuzzy1 86.35 1.93 86.27 1.34 -3% - 3% Phrase 18.92 0.57 18.94 0.57 -5% - 6% SpanNear 1.46 0.02 1.46 0.05 -4% - 5% SloppyPhrase 15.85 0.69 15.93 0.69 -7% - 9% Fuzzy2 31.37 0.61 31.65 0.53 -2% - 4% TermBGroup1M1P 44.99 1.32 45.47 0.74 -3% - 5% AndHighMed 40.22 1.00 41.43 0.32 0% - 6% Wildcard 26.11 1.15 27.15 0.21 -1% - 9% OrHighHigh 6.14 0.42 6.40 0.34 -7% - 17% OrHighMed 10.65 0.72 11.10 0.60 -7% - 17% AndHighHigh 9.16 0.33 9.56 0.04 0% - 8% Prefix3 43.07 2.32 45.34 0.55 -1% - 12% Term 34.11 1.60 36.09 1.19 -2% - 14% IntNRQ 7.66 0.64 8.22 0.57 -7% - 25%
        Hide
        Simon Willnauer added a comment -

        here is a first cut at using FST to hold terms in sorted DocValues. This patch holds all data in an FST and currently doesn't support a direct source ie all FSTs are loaded into memory even during merging. All test (except BWcompat – wich is good!) pass. I think we can have this as a first step but not being the default?

        Show
        Simon Willnauer added a comment - here is a first cut at using FST to hold terms in sorted DocValues. This patch holds all data in an FST and currently doesn't support a direct source ie all FSTs are loaded into memory even during merging. All test (except BWcompat – wich is good!) pass. I think we can have this as a first step but not being the default?
        Hide
        Simon Willnauer added a comment -

        next iteration...

        • using FSTEnum to do getOrdByValue
        • opened up FSTEnum to subclass in different package
        • FSTEnum#seekCeil/Floor now returns a SeekStatus to prevent comparisons
        • fixed some tests to not rely on a "default" values that is a general test problem in docvalues not related to this issue

        I still have some nocommits in there. The current impl of getOrdByValue creates a new FSTEnum for each lookup which is costly we should be able to share it somehow but I really don't like using a ThreadLocal maybe we need a Lookup object or something like that but I am not sure yet.

        Show
        Simon Willnauer added a comment - next iteration... using FSTEnum to do getOrdByValue opened up FSTEnum to subclass in different package FSTEnum#seekCeil/Floor now returns a SeekStatus to prevent comparisons fixed some tests to not rely on a "default" values that is a general test problem in docvalues not related to this issue I still have some nocommits in there. The current impl of getOrdByValue creates a new FSTEnum for each lookup which is costly we should be able to share it somehow but I really don't like using a ThreadLocal maybe we need a Lookup object or something like that but I am not sure yet.
        Hide
        Michael McCandless added a comment -

        Very cool!

        Why did you need a custom subclass of FSTEnum (instead of
        BytesRefFSTEnum)? Is it for getOutput? Couldn't you
        getCurrent().output instead?

        We could probably make a static Util.getCeil (like Util.get, but
        finds ceiling term if there is no exact match) ... would likely be
        faster than re-using an FSTEnum ... but sorta hairy to write.

        I'm curious how DV sort perf compares ... hopefully it's little loss
        (going on my previous tests here...).

        Awesome to fix seekCeil to return enum result ... maybe factor out the
        same enum from TermsEnum so we can share it? And you can remove the
        two TODOs from FSTEnum.java...

        Can ToFSTBytesRefConsumer be moved to FSTSortedBytesImpl.java?

        Show
        Michael McCandless added a comment - Very cool! Why did you need a custom subclass of FSTEnum (instead of BytesRefFSTEnum)? Is it for getOutput? Couldn't you getCurrent().output instead? We could probably make a static Util.getCeil (like Util.get, but finds ceiling term if there is no exact match) ... would likely be faster than re-using an FSTEnum ... but sorta hairy to write. I'm curious how DV sort perf compares ... hopefully it's little loss (going on my previous tests here...). Awesome to fix seekCeil to return enum result ... maybe factor out the same enum from TermsEnum so we can share it? And you can remove the two TODOs from FSTEnum.java... Can ToFSTBytesRefConsumer be moved to FSTSortedBytesImpl.java?
        Hide
        David Smiley added a comment -

        See LUCENE-4562 which should improve sorting performance across segments.

        I also wonder if the ord's might be accessible to the higher level APIs so that those APIs could work off of ord's (int) instead of terms (byteref). Maybe this is too radical, or I'm out of my league on understanding this stuff and I'm making no sense.

        Show
        David Smiley added a comment - See LUCENE-4562 which should improve sorting performance across segments. I also wonder if the ord's might be accessible to the higher level APIs so that those APIs could work off of ord's (int) instead of terms (byteref). Maybe this is too radical, or I'm out of my league on understanding this stuff and I'm making no sense.
        Hide
        Michael McCandless added a comment -

        I also wonder if the ord's might be accessible to the higher level APIs so that those APIs could work off of ord's (int) instead of terms (byteref).

        The challenge is the ords are not "global" (ie not a total order): they are only ordered for the terms within one segment, unless your index is single segment, or you're working with a top-level reader.

        Show
        Michael McCandless added a comment - I also wonder if the ord's might be accessible to the higher level APIs so that those APIs could work off of ord's (int) instead of terms (byteref). The challenge is the ords are not "global" (ie not a total order): they are only ordered for the terms within one segment, unless your index is single segment, or you're working with a top-level reader.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development