Lucene - Core
  1. Lucene - Core
  2. LUCENE-2723

Speed up Lucene's low level bulk postings read API

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 4.1
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-1410.

      The flex DocsEnum has a simple bulk-read API that reads the next chunk
      of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR
      (from LUCENE-1410). This is not unlike sucking coffee through those
      tiny plastic coffee stirrers they hand out airplanes that,
      surprisingly, also happen to function as a straw.

      As a result we see no perf gain from using FOR/PFOR.

      I had hacked up a fix for this, described at in my blog post at
      http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html

      I'm opening this issue to get that work to a committable point.

      So... I've worked out a new bulk-read API to address performance
      bottleneck. It has some big changes over the current bulk-read API:

      • You can now also bulk-read positions (but not payloads), but, I
        have yet to cutover positional queries.
      • The buffer contains doc deltas, not absolute values, for docIDs
        and positions (freqs are absolute).
      • Deleted docs are not filtered out.
      • The doc & freq buffers need not be "aligned". For fixed intblock
        codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
        Group varint, etc.) they won't be.

      It's still a work in progress...

      1. LUCENE-2723_bulkvint.patch
        10 kB
        Robert Muir
      2. LUCENE-2723_facetPerSeg.patch
        33 kB
        Yonik Seeley
      3. LUCENE-2723_facetPerSeg.patch
        8 kB
        Yonik Seeley
      4. LUCENE-2723_openEnum.patch
        0.8 kB
        Yonik Seeley
      5. LUCENE-2723_termscorer.patch
        16 kB
        Robert Muir
      6. LUCENE-2723_wastedint.patch
        23 kB
        Robert Muir
      7. LUCENE-2723.patch
        4 kB
        Simon Willnauer
      8. LUCENE-2723.patch
        153 kB
        Simon Willnauer
      9. LUCENE-2723.patch
        153 kB
        Robert Muir
      10. LUCENE-2723.patch
        170 kB
        Simon Willnauer
      11. LUCENE-2723.patch
        401 kB
        Michael McCandless
      12. LUCENE-2723.patch
        396 kB
        Michael McCandless
      13. LUCENE-2723-BulkEnumWrapper.patch
        13 kB
        Simon Willnauer
      14. LUCENE-2723-termscorer.patch
        20 kB
        Simon Willnauer
      15. LUCENE-2723-termscorer.patch
        20 kB
        Simon Willnauer
      16. LUCENE-2723-termscorer.patch
        21 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Attached patch, which also includes FrameOfRef/PatchedFrameOfRef
          codecs modified some from LUCENE-1410.

          All tests pass (at least once, heh) for all core/test codecs, but,
          there are still many nocommits to work through...

          I've cutover a few places to the new API (the old one is still there
          but I plan to fully cutover and remove it before committing).

          Show
          Michael McCandless added a comment - Attached patch, which also includes FrameOfRef/PatchedFrameOfRef codecs modified some from LUCENE-1410 . All tests pass (at least once, heh) for all core/test codecs, but, there are still many nocommits to work through... I've cutover a few places to the new API (the old one is still there but I plan to fully cutover and remove it before committing).
          Hide
          Michael McCandless added a comment -

          New patch, sync'd to trunk.

          Show
          Michael McCandless added a comment - New patch, sync'd to trunk.
          Hide
          Simon Willnauer added a comment -

          New patch, sync'd to trunk.

          Amazing mike - what a huge patch. Unfortunately there is a file missing in the patch - benchmark can not find GatherFieldValuesTask and o.a.l.u.pfor.Pfor.java uses Java 6 Arrays.copyOfRange(..)

          I will hack around and run some tests

          Show
          Simon Willnauer added a comment - New patch, sync'd to trunk. Amazing mike - what a huge patch. Unfortunately there is a file missing in the patch - benchmark can not find GatherFieldValuesTask and o.a.l.u.pfor.Pfor.java uses Java 6 Arrays.copyOfRange(..) I will hack around and run some tests
          Hide
          Michael McCandless added a comment -

          Ugh, sorry – that GatherFieldValues isn't going to be committed; you can discard all mods under contrib/benchmark in the patch.

          And we'll have to fix the copyOfRange before committing. I'll stuck a nocommit on it

          Show
          Michael McCandless added a comment - Ugh, sorry – that GatherFieldValues isn't going to be committed; you can discard all mods under contrib/benchmark in the patch. And we'll have to fix the copyOfRange before committing. I'll stuck a nocommit on it
          Hide
          Uwe Schindler added a comment -

          And we'll have to fix the copyOfRange before committing. I'll stuck a nocommit on it

          The copyOfRange methods in Java 6 are 3liners in src.zip/harmony. We can add the harmony versions to ArrayUtil and we are fine. It is for native types veeeeery easy and for non-native types (? extends Object) it uses the Array reflection class to generate a target array of same type String[] -> String[].

          I could supply a patch. Which variants (datatypes) do you need?

          Show
          Uwe Schindler added a comment - And we'll have to fix the copyOfRange before committing. I'll stuck a nocommit on it The copyOfRange methods in Java 6 are 3liners in src.zip/harmony. We can add the harmony versions to ArrayUtil and we are fine. It is for native types veeeeery easy and for non-native types (? extends Object) it uses the Array reflection class to generate a target array of same type String[] -> String[]. I could supply a patch. Which variants (datatypes) do you need?
          Hide
          Simon Willnauer added a comment -

          Mike, this patch is so impressive that I didn't even make it to the bottom. I think that is bigger than the docValues branch changes, so here is my take on it.... I think this is extremely valuable but we really should try to split that up into multiple patches. The easiest take on it would be moving PFoR and FoR out of it to make it at least reviewable.
          I would volunteer to do so if we agree on it. I am actually close to proposing a branch for that since its truely huge and it would be easier to keep up with trunk really. Either way, let me know and I get it started....

          simon

          Show
          Simon Willnauer added a comment - Mike, this patch is so impressive that I didn't even make it to the bottom. I think that is bigger than the docValues branch changes, so here is my take on it.... I think this is extremely valuable but we really should try to split that up into multiple patches. The easiest take on it would be moving PFoR and FoR out of it to make it at least reviewable. I would volunteer to do so if we agree on it. I am actually close to proposing a branch for that since its truely huge and it would be easier to keep up with trunk really. Either way, let me know and I get it started.... simon
          Hide
          Michael McCandless added a comment -

          Definitely, let's move FOR/PFOR out onto a separate issue – they don't need to be included in the cutover. I just use them for testing the bulk API.

          Show
          Michael McCandless added a comment - Definitely, let's move FOR/PFOR out onto a separate issue – they don't need to be included in the cutover. I just use them for testing the bulk API.
          Hide
          Simon Willnauer added a comment -

          here is a patch without all FoR and PFoR ~ 150kb sounds more reasonable though. All tests pass on trunk

          we have 55 nocommit left in that patch but it actually seems to be quite close to me. I will start iterating next week I guess to mike if you feel like it go ahead start iterating.....

          simon

          Show
          Simon Willnauer added a comment - here is a patch without all FoR and PFoR ~ 150kb sounds more reasonable though. All tests pass on trunk we have 55 nocommit left in that patch but it actually seems to be quite close to me. I will start iterating next week I guess to mike if you feel like it go ahead start iterating..... simon
          Hide
          Robert Muir added a comment -

          here's an updated patch... i did the easy stuff, 41 nocommits left (i added one).

          Some notes:

          • we should remove the old API everywhere... contrib/misc/HighFreqTerms and solr/faceting/UninvertedField are the only users left?
          • TermScorer is very scary, we gotta simplify. I don't like not being able to understand it.
          • For MultiTermQueries, TermQuery shouldnt calculate that per-segment docFreq... MTQ "knows" this docFreq, so maybe we need to solve the TermState first. i added a nocommit here as it will hurt performance.
          • exposing the MMap buffer i dont think is helpful, and dangerous. The Pfor/For should work on DataInput.readInt() instead of wrapping bytebuffers as intbuffers? anyway this can be dealt with on the pfor/for issue.
          Show
          Robert Muir added a comment - here's an updated patch... i did the easy stuff, 41 nocommits left (i added one). Some notes: we should remove the old API everywhere... contrib/misc/HighFreqTerms and solr/faceting/UninvertedField are the only users left? TermScorer is very scary, we gotta simplify. I don't like not being able to understand it. For MultiTermQueries, TermQuery shouldnt calculate that per-segment docFreq... MTQ "knows" this docFreq, so maybe we need to solve the TermState first. i added a nocommit here as it will hurt performance. exposing the MMap buffer i dont think is helpful, and dangerous. The Pfor/For should work on DataInput.readInt() instead of wrapping bytebuffers as intbuffers? anyway this can be dealt with on the pfor/for issue.
          Hide
          Simon Willnauer added a comment -

          here is a frist try to clean up the TermScorer a little bit. I introduced a helper class BulkReaderConsumer.java that generalizes some of the operations. Before we shoot for something like this super tough TermScorer I would really be interested if it isn't premature optimization and if hotspot could do a good job with small and simple methods.

          mike can you give it a try on beast?

          Show
          Simon Willnauer added a comment - here is a frist try to clean up the TermScorer a little bit. I introduced a helper class BulkReaderConsumer.java that generalizes some of the operations. Before we shoot for something like this super tough TermScorer I would really be interested if it isn't premature optimization and if hotspot could do a good job with small and simple methods. mike can you give it a try on beast?
          Hide
          Michael McCandless added a comment -

          Simon, the benchmarker barfs on conjunction queries w/ the improvement to TermScorer (eg +nebraska +states gets different results).

          Show
          Michael McCandless added a comment - Simon, the benchmarker barfs on conjunction queries w/ the improvement to TermScorer (eg +nebraska +states gets different results).
          Hide
          Michael McCandless added a comment -

          I disabled the conjunction queries and then ran perf test w/ the simplified TermScorer:

          Query QPS base QPS simon Pct diff
          state 32.53 29.43 -9.5%
          unit~2.0 16.88 15.83 -6.2%
          unit~1.0 17.17 16.21 -5.6%
          unit state 9.64 9.12 -5.4%
          spanFirst(unit, 5) 17.09 16.79 -1.7%
          united~1.0 62.01 60.95 -1.7%
          united~2.0 15.86 15.70 -1.0%
          u*d 31.09 30.77 -1.0%
          spanNear([unit, state], 10, true) 4.17 4.13 -1.0%
          uni* 14.38 14.27 -0.7%
          unit* 28.45 28.24 -0.7%
          un*d 67.19 66.78 -0.6%
          "unit state" 8.59 8.56 -0.3%

          Unfortunately the extra abstraction (BlockReaderConsumer) made things a good amount slower...

          Show
          Michael McCandless added a comment - I disabled the conjunction queries and then ran perf test w/ the simplified TermScorer: Query QPS base QPS simon Pct diff state 32.53 29.43 -9.5% unit~2.0 16.88 15.83 -6.2% unit~1.0 17.17 16.21 -5.6% unit state 9.64 9.12 -5.4% spanFirst(unit, 5) 17.09 16.79 -1.7% united~1.0 62.01 60.95 -1.7% united~2.0 15.86 15.70 -1.0% u*d 31.09 30.77 -1.0% spanNear( [unit, state] , 10, true) 4.17 4.13 -1.0% uni* 14.38 14.27 -0.7% unit* 28.45 28.24 -0.7% un*d 67.19 66.78 -0.6% "unit state" 8.59 8.56 -0.3% Unfortunately the extra abstraction (BlockReaderConsumer) made things a good amount slower...
          Hide
          Michael McCandless added a comment -

          I think we should cut a branch starting w/ Robert's latest patch... I'll do that soon.

          Show
          Michael McCandless added a comment - I think we should cut a branch starting w/ Robert's latest patch... I'll do that soon.
          Hide
          Michael McCandless added a comment -

          OK I created the branch at https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings ... all tests pass now.

          Show
          Michael McCandless added a comment - OK I created the branch at https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings ... all tests pass now.
          Hide
          Yonik Seeley added a comment -

          FYI, just so we don't duplicate work, I'm in the middle of converting some of the solr uses.

          Show
          Yonik Seeley added a comment - FYI, just so we don't duplicate work, I'm in the middle of converting some of the solr uses.
          Hide
          Robert Muir added a comment -

          Here's a patch to make TermScorer more readable: advance() is still scary
          but the rest starts to look reasonable.

          I pulled out the omitTF case into a MatchOnlyTermScorer.
          Here's the benchmark with luceneutil.

                         Query  QPS branch   QPS patch  Pct diff
          spanNear([unit, state], 10, true)        2.91        2.87     -1.3%
                          uni*       11.36       11.31     -0.4%
                         unit*       20.89       20.81     -0.4%
                  "unit state"        6.14        6.13     -0.2%
                           u*d       17.30       17.28     -0.1%
                    unit state        7.47        7.46     -0.1%
                          un*d       55.42       55.69      0.5%
            spanFirst(unit, 5)       12.27       12.34      0.6%
                    united~2.0       13.51       13.61      0.7%
                    united~1.0       49.88       50.30      0.8%
                      unit~1.0       13.00       13.27      2.0%
                         state       27.67       28.32      2.4%
                      unit~2.0       12.46       12.79      2.6%
              +nebraska +state       75.91       79.97      5.3%
                  +unit +state        8.63        9.25      7.1%
          
          Show
          Robert Muir added a comment - Here's a patch to make TermScorer more readable: advance() is still scary but the rest starts to look reasonable. I pulled out the omitTF case into a MatchOnlyTermScorer. Here's the benchmark with luceneutil. Query QPS branch QPS patch Pct diff spanNear([unit, state], 10, true) 2.91 2.87 -1.3% uni* 11.36 11.31 -0.4% unit* 20.89 20.81 -0.4% "unit state" 6.14 6.13 -0.2% u*d 17.30 17.28 -0.1% unit state 7.47 7.46 -0.1% un*d 55.42 55.69 0.5% spanFirst(unit, 5) 12.27 12.34 0.6% united~2.0 13.51 13.61 0.7% united~1.0 49.88 50.30 0.8% unit~1.0 13.00 13.27 2.0% state 27.67 28.32 2.4% unit~2.0 12.46 12.79 2.6% +nebraska +state 75.91 79.97 5.3% +unit +state 8.63 9.25 7.1%
          Hide
          Robert Muir added a comment -

          by the way, i tried an experiment... i don't much trust the +nebraska +state results in general,
          but the +unit +state is more stable and I ran this one twice to be sure... i think we should investigate more

          in advance() I changed:

              // not found in current block, seek underlying stream
              BulkPostingsEnum.JumpResult jumpResult = docsEnum.jump(target, count);
              if (jumpResult != null) {
          

          to:

              // not found in current block, seek underlying stream
              BulkPostingsEnum.JumpResult jumpResult;
              if (target - doc > docDeltas.length*4 && (jumpResult = docsEnum.jump(target, count)) != null) {
          
          Query  QPS branch   QPS patch  Pct diff
              +nebraska +state       75.91       84.46     11.3%
                  +unit +state        8.67       10.01     15.4%
          
          Query  QPS branch   QPS patch  Pct diff
              +nebraska +state       73.23       80.27      9.6%
                  +unit +state        8.66       10.00     15.5%
          
          Show
          Robert Muir added a comment - by the way, i tried an experiment... i don't much trust the +nebraska +state results in general, but the +unit +state is more stable and I ran this one twice to be sure... i think we should investigate more in advance() I changed: // not found in current block, seek underlying stream BulkPostingsEnum.JumpResult jumpResult = docsEnum.jump(target, count); if (jumpResult != null) { to: // not found in current block, seek underlying stream BulkPostingsEnum.JumpResult jumpResult; if (target - doc > docDeltas.length*4 && (jumpResult = docsEnum.jump(target, count)) != null) { Query QPS branch QPS patch Pct diff +nebraska +state 75.91 84.46 11.3% +unit +state 8.67 10.01 15.4% Query QPS branch QPS patch Pct diff +nebraska +state 73.23 80.27 9.6% +unit +state 8.66 10.00 15.5%
          Hide
          Robert Muir added a comment -

          ok after a lot of benchmarking with different "short jump distances" i found the best thing
          to do was the most conservative: just avoid useless jumps when the target has to be in the
          next fill()

          anything else was just in the wind: it only matters for high freq terms anyway.

          I committed this in r1049411, since its always a win and independent of the refactoring.

          Show
          Robert Muir added a comment - ok after a lot of benchmarking with different "short jump distances" i found the best thing to do was the most conservative: just avoid useless jumps when the target has to be in the next fill() anything else was just in the wind: it only matters for high freq terms anyway. I committed this in r1049411, since its always a win and independent of the refactoring.
          Hide
          Michael McCandless added a comment -

          That's awesome speedup Robert!

          Show
          Michael McCandless added a comment - That's awesome speedup Robert!
          Hide
          Michael McCandless added a comment -

          I think we should commit the prototype FOR/PFOR codec onto the branch so that we can run perf tests and properly stretch the new API...

          if these codecs aren't ready (but the core bulk API changes are) by the time we want to land back then we can always drop them on merging back.

          So I'll go commit them onto the branch...

          Show
          Michael McCandless added a comment - I think we should commit the prototype FOR/PFOR codec onto the branch so that we can run perf tests and properly stretch the new API... if these codecs aren't ready (but the core bulk API changes are) by the time we want to land back then we can always drop them on merging back. So I'll go commit them onto the branch...
          Hide
          Michael McCandless added a comment -

          OK I brought PFOR/FOR back to life on the branch; all tests pass w/ each codec.

          Show
          Michael McCandless added a comment - OK I brought PFOR/FOR back to life on the branch; all tests pass w/ each codec.
          Hide
          Simon Willnauer added a comment -

          I played around a little with recommended actions from hotspot wiki and factored some methods out on TemScorer and MatchOnlyTermScorer. I merged my changes with the ones from Robert from the latest patch.
          All tests pass and I checked with a10M docs wikipedia index if base (branch) and spec (branch + patch). I actually was surprised about the results though:

          
          
                         Query    QPS base    QPS spec  Pct diff
                  "unit state"        1.71        1.70     -0.6%
            spanFirst(unit, 5)        4.81        4.79     -0.5%
                          un*d       14.94       14.88     -0.4%
                    united~1.0        8.79        8.76     -0.4%
                         unit*        8.70        8.66     -0.4%
                          uni*        4.16        4.14     -0.3%
                           u*d        4.01        4.01     -0.1%
                    united~2.0        1.88        1.88      0.0%
          spanNear([unit, state], 10, true)        0.94        0.96      2.8%
                      unit~1.0        4.32        4.49      4.0%
                      unit~2.0        4.20        4.38      4.4%
              +nebraska +state       24.90       26.11      4.8%
                  +unit +state        4.60        4.97      8.0%
                    unit state        3.60        3.93      9.1%
                         state        9.83       10.98     11.7%
          

          I ran those twice with very similar results....
          3 iters, 40 iters per JVM and 2 threads on delmulti index

          hers is my JVM

          java version "1.6.0_22"
          Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
          Java HotSpot(TM) Server VM (build 17.1-b03, mixed mode)
          

          started with:

            java -Xbatch -Xms2g -Xmx2g -server 
          
          Show
          Simon Willnauer added a comment - I played around a little with recommended actions from hotspot wiki and factored some methods out on TemScorer and MatchOnlyTermScorer. I merged my changes with the ones from Robert from the latest patch. All tests pass and I checked with a10M docs wikipedia index if base (branch) and spec (branch + patch). I actually was surprised about the results though: Query QPS base QPS spec Pct diff "unit state" 1.71 1.70 -0.6% spanFirst(unit, 5) 4.81 4.79 -0.5% un*d 14.94 14.88 -0.4% united~1.0 8.79 8.76 -0.4% unit* 8.70 8.66 -0.4% uni* 4.16 4.14 -0.3% u*d 4.01 4.01 -0.1% united~2.0 1.88 1.88 0.0% spanNear([unit, state], 10, true ) 0.94 0.96 2.8% unit~1.0 4.32 4.49 4.0% unit~2.0 4.20 4.38 4.4% +nebraska +state 24.90 26.11 4.8% +unit +state 4.60 4.97 8.0% unit state 3.60 3.93 9.1% state 9.83 10.98 11.7% I ran those twice with very similar results.... 3 iters, 40 iters per JVM and 2 threads on delmulti index hers is my JVM java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) Server VM (build 17.1-b03, mixed mode) started with: java -Xbatch -Xms2g -Xmx2g -server
          Hide
          Simon Willnauer added a comment -

          updated patch to work with mikes simplification with pre-filled buffer on Jump

          Show
          Simon Willnauer added a comment - updated patch to work with mikes simplification with pre-filled buffer on Jump
          Hide
          Robert Muir added a comment -

          in the MatchOnly advance() we should avoid the useless jumps too?

          Show
          Robert Muir added a comment - in the MatchOnly advance() we should avoid the useless jumps too?
          Hide
          Robert Muir added a comment -

          fyi: i'll be working on removing the mmap hack (i think we shouldnt expose its buffer
          and create intbuffer views over it, and i think simply working on datainput in for/pfor will actually be faster).

          Show
          Robert Muir added a comment - fyi: i'll be working on removing the mmap hack (i think we shouldnt expose its buffer and create intbuffer views over it, and i think simply working on datainput in for/pfor will actually be faster).
          Hide
          Robert Muir added a comment -

          before resolving this issue I think we are going to need at least basic unit tests for PFor/FOR,
          I don't think we should rely exclusively on randomly using PatchedFrameOfRef in unit tests.

          I looked at LUCENE-1410 but it seemed the unit tests there are really performance tests...

          Show
          Robert Muir added a comment - before resolving this issue I think we are going to need at least basic unit tests for PFor/FOR, I don't think we should rely exclusively on randomly using PatchedFrameOfRef in unit tests. I looked at LUCENE-1410 but it seemed the unit tests there are really performance tests...
          Hide
          Simon Willnauer added a comment -

          linking issues as we need to resolve the TermState patch before this can go to trunk

          Show
          Simon Willnauer added a comment - linking issues as we need to resolve the TermState patch before this can go to trunk
          Hide
          Michael McCandless added a comment -

          Patch looks great Simon – much simpler! And faster

          Show
          Michael McCandless added a comment - Patch looks great Simon – much simpler! And faster
          Hide
          Simon Willnauer added a comment -

          added roberts prevent-useless-jump improvement to MatchOnlyScorer
          I think this is ready - I will commit shortly

          Show
          Simon Willnauer added a comment - added roberts prevent-useless-jump improvement to MatchOnlyScorer I think this is ready - I will commit shortly
          Hide
          Yonik Seeley added a comment - - edited

          I did some faceting performance tests of solr trunk vs branch.
          This is a 10M doc index w/ standard codec. It looks like terms that match many docs got a little speedup, but terms that match few docs were slowed down.

          Trunk

          unique values in field cache.minDf=0 cache.minDf=maxdoc
          10 133 208
          100 176 256
          1000 214 314
          10000 433 516
          100000 2166 2096
          10000000 30974 21093

          Branch:

          unique values in field cache.minDf=0 cache.minDf=maxdoc
          10 132 187
          100 171 237
          1000 223 292
          10000 487 547
          100000 3190 3123
          10000000 45071 36504

          One think trunk had was specialized per-segment code - it used MultiDocsEnum.getSubs(), so this is not necessarily a problem with the new bulk codec.

          update: to see if it was MultiFields related (i.e. that Solr's branch code does not yet work per-segment where trunk does) I optimized the index and reran the worst-case (10M unique values). result: trunk=1.4sec, branch=25sec.

          Show
          Yonik Seeley added a comment - - edited I did some faceting performance tests of solr trunk vs branch. This is a 10M doc index w/ standard codec. It looks like terms that match many docs got a little speedup, but terms that match few docs were slowed down. Trunk unique values in field cache.minDf=0 cache.minDf=maxdoc 10 133 208 100 176 256 1000 214 314 10000 433 516 100000 2166 2096 10000000 30974 21093 Branch: unique values in field cache.minDf=0 cache.minDf=maxdoc 10 132 187 100 171 237 1000 223 292 10000 487 547 100000 3190 3123 10000000 45071 36504 One think trunk had was specialized per-segment code - it used MultiDocsEnum.getSubs(), so this is not necessarily a problem with the new bulk codec. update: to see if it was MultiFields related (i.e. that Solr's branch code does not yet work per-segment where trunk does) I optimized the index and reran the worst-case (10M unique values). result: trunk=1.4sec, branch=25sec.
          Hide
          Yonik Seeley added a comment -

          I tested the optimized index with mike's latest patches (since that's "per segment" on both branch and trunk). Things are much more in line now... with the branch being anywhere from 2.3% to 5.4% slower, depending on the exact field tested.

          Show
          Yonik Seeley added a comment - I tested the optimized index with mike's latest patches (since that's "per segment" on both branch and trunk). Things are much more in line now... with the branch being anywhere from 2.3% to 5.4% slower, depending on the exact field tested.
          Hide
          Robert Muir added a comment -

          patch with more refactoring of For/Pfor decompression:

          • The decompressors take DataInput, but still use the IntBuffer technique for now.
          • I removed the wasted int-per-block in For.
          Show
          Robert Muir added a comment - patch with more refactoring of For/Pfor decompression: The decompressors take DataInput, but still use the IntBuffer technique for now. I removed the wasted int-per-block in For.
          Hide
          Yonik Seeley added a comment -

          Here's a small patch that may be sufficient to enable dropping down to per-segment work while still using MultiTerms/MultiTermsEnum to traverse terms in order. It basically makes the TermsEnumWithSlice members public, and adds a bulkPostings member for reuse.

          Is this the right approach?

          Show
          Yonik Seeley added a comment - Here's a small patch that may be sufficient to enable dropping down to per-segment work while still using MultiTerms/MultiTermsEnum to traverse terms in order. It basically makes the TermsEnumWithSlice members public, and adds a bulkPostings member for reuse. Is this the right approach?
          Hide
          Michael McCandless added a comment -

          Looks good Yonik!

          Show
          Michael McCandless added a comment - Looks good Yonik!
          Hide
          Yonik Seeley added a comment -

          Here's a patch that changes just one pace in faceting to per-segment bulk.
          Same 10M doc index, testing w/ cache.minDf=maxdoc only (no use of filterCache since I haven't changed that to per-seg yet). Time in ms.

          unique values in field trunk per-seg branch per-seg speedup
          10 161 173 -7%
          100 217 218 -0%
          1000 267 262 2%
          10000 465 325 43%
          100000 2025 678 199%
          10000000 21061 4393 379%

          Now, facet.method=enum wasn't even designed for many unique values in a field, but this more efficient per-segment bulk code certainly expands the range where it's feasible. My guess is that the speedup is due to us dropping to per-segment quicker with this patch (trunk get's a bulk enum, and then drops to per-segment).

          The drop in performance for the high df field (each value will match ~1M docs) is curious. Seems like this should be a more efficient inner loop, but I guess hotspot just optimized it differently.

          Based on these results, I'll re-convert the rest of the code to go per-segment too.

          Show
          Yonik Seeley added a comment - Here's a patch that changes just one pace in faceting to per-segment bulk. Same 10M doc index, testing w/ cache.minDf=maxdoc only (no use of filterCache since I haven't changed that to per-seg yet). Time in ms. unique values in field trunk per-seg branch per-seg speedup 10 161 173 -7% 100 217 218 -0% 1000 267 262 2% 10000 465 325 43% 100000 2025 678 199% 10000000 21061 4393 379% Now, facet.method=enum wasn't even designed for many unique values in a field, but this more efficient per-segment bulk code certainly expands the range where it's feasible. My guess is that the speedup is due to us dropping to per-segment quicker with this patch (trunk get's a bulk enum, and then drops to per-segment). The drop in performance for the high df field (each value will match ~1M docs) is curious. Seems like this should be a more efficient inner loop, but I guess hotspot just optimized it differently. Based on these results, I'll re-convert the rest of the code to go per-segment too.
          Hide
          Yonik Seeley added a comment -

          Here's an update to faceting per seg w/ the new bulk API.

          Show
          Yonik Seeley added a comment - Here's an update to faceting per seg w/ the new bulk API.
          Hide
          Robert Muir added a comment -

          attached is a silly codec, but i think useful when benchmarking:
          an intblock codec that encodes blocks with vint, with a single vint header
          containing the uncompressed length in bytes.

          I think this is useful because when comparing StandardCodec to FixedIntBlockCodecs such as FOR/PFOR, and also considering the bulk read API, there is a lot of variables in play... especially different I/O access patterns.

          so I think we should try to separate some of these variables out, and a FixedIntBlockCodec that uses Vint is useful to having a real "baseline" when looking at other compression algorithms, because FixedIntBlock is very different from standard already.

          here's a comparison between Standard (bulkpostings branch) and VintBlock (bulkpostings branch):
          I used mmap on windows (but also tried simplefs with similar results)

          anyway, i think its interesting enough for future benchmarking i'd like to commit to the branch. we can
          later delete this codec or simply change MockFixedIntBlock to work like this one.

          Query QPS Standard QPS BulkVInt Pct diff
          u*d 12.43 10.55 -15.1%
          un*d 45.05 39.70 -11.9%
          "unit state"~3 3.75 3.43 -8.5%
          +nebraska +state 69.97 65.92 -5.8%
          spanNear([unit, state], 10, true) 3.01 2.92 -2.7%
          spanFirst(unit, 5) 11.39 11.35 -0.4%
          "unit state" 5.99 5.99 -0.1%
          united~1.0 9.27 9.70 4.6%
          united~2.0 1.95 2.04 4.9%
          unit~1.0 5.89 6.34 7.6%
          unit~2.0 5.76 6.20 7.8%
          unit state 7.00 8.05 15.0%
          +unit +state 9.11 11.09 21.7%
          unit* 23.28 29.50 26.7%
          uni* 13.29 16.93 27.4%
          state 21.43 28.79 34.3%
          Show
          Robert Muir added a comment - attached is a silly codec, but i think useful when benchmarking: an intblock codec that encodes blocks with vint, with a single vint header containing the uncompressed length in bytes. I think this is useful because when comparing StandardCodec to FixedIntBlockCodecs such as FOR/PFOR, and also considering the bulk read API, there is a lot of variables in play... especially different I/O access patterns. so I think we should try to separate some of these variables out, and a FixedIntBlockCodec that uses Vint is useful to having a real "baseline" when looking at other compression algorithms, because FixedIntBlock is very different from standard already. here's a comparison between Standard (bulkpostings branch) and VintBlock (bulkpostings branch): I used mmap on windows (but also tried simplefs with similar results) anyway, i think its interesting enough for future benchmarking i'd like to commit to the branch. we can later delete this codec or simply change MockFixedIntBlock to work like this one. Query QPS Standard QPS BulkVInt Pct diff u*d 12.43 10.55 -15.1% un*d 45.05 39.70 -11.9% "unit state"~3 3.75 3.43 -8.5% +nebraska +state 69.97 65.92 -5.8% spanNear( [unit, state] , 10, true) 3.01 2.92 -2.7% spanFirst(unit, 5) 11.39 11.35 -0.4% "unit state" 5.99 5.99 -0.1% united~1.0 9.27 9.70 4.6% united~2.0 1.95 2.04 4.9% unit~1.0 5.89 6.34 7.6% unit~2.0 5.76 6.20 7.8% unit state 7.00 8.05 15.0% +unit +state 9.11 11.09 21.7% unit* 23.28 29.50 26.7% uni* 13.29 16.93 27.4% state 21.43 28.79 34.3%
          Hide
          Robert Muir added a comment -

          looks like the culprit to the slower low-freq terms (and the culprit to larger index) is this in SepPostingsWriter:

                // nocommit -- only write if docFreq > skipInterval?
                docOut.writeVLong(skipOut.getFilePointer());
          

          I think this hurts the low freq terms and why we see slow wildcards etc.

          Show
          Robert Muir added a comment - looks like the culprit to the slower low-freq terms (and the culprit to larger index) is this in SepPostingsWriter: // nocommit -- only write if docFreq > skipInterval? docOut.writeVLong(skipOut.getFilePointer()); I think this hurts the low freq terms and why we see slow wildcards etc.
          Hide
          Robert Muir added a comment -

          also the payloads pointer... we need this out of the .doc!

          Show
          Robert Muir added a comment - also the payloads pointer... we need this out of the .doc!
          Hide
          Yonik Seeley added a comment -

          Should we keep MultiBulkPostingsEnum?
          Even when someone writes their code to work per-segment, not all IndexReader implementations may be able to provide segment-level readers. ParallelReader is one that can't currently?

          Show
          Yonik Seeley added a comment - Should we keep MultiBulkPostingsEnum? Even when someone writes their code to work per-segment, not all IndexReader implementations may be able to provide segment-level readers. ParallelReader is one that can't currently?
          Hide
          Michael McCandless added a comment -

          Should we keep MultiBulkPostingsEnum?

          I think we have to keep it. EG if someone makes a SlowMultiReaderWrapper and then run searches on it...

          ParallelReader is one that can't currently?

          ParallelReader is a tricky one.

          If your ParallelReader only contains SegmentReaders (and eg you make a MultiReader on top), then everything's great, because ParallelReader dispatches by field to a unique SegmentReader.

          But if instead you make a ParallelReader whose child readers are themselves MultiReaders, then, yes it's basically the same as wrapping all of these subs in a SlowMultiReaderWrapper.

          Show
          Michael McCandless added a comment - Should we keep MultiBulkPostingsEnum? I think we have to keep it. EG if someone makes a SlowMultiReaderWrapper and then run searches on it... ParallelReader is one that can't currently? ParallelReader is a tricky one. If your ParallelReader only contains SegmentReaders (and eg you make a MultiReader on top), then everything's great, because ParallelReader dispatches by field to a unique SegmentReader. But if instead you make a ParallelReader whose child readers are themselves MultiReaders, then, yes it's basically the same as wrapping all of these subs in a SlowMultiReaderWrapper.
          Hide
          Simon Willnauer added a comment -

          the blocker has been committed - we should merge though!

          Show
          Simon Willnauer added a comment - the blocker has been committed - we should merge though!
          Hide
          Robert Muir added a comment -

          I merged us up to yesterday (1052991:1057836), but stopped at the Pulsing codec rewrite

          Mike can you assist in merging r1057897? Besides requiring a lot of beer there is a
          danger of screwing it up, since we have to re-implement its bulk postings enum.

          Show
          Robert Muir added a comment - I merged us up to yesterday (1052991:1057836), but stopped at the Pulsing codec rewrite Mike can you assist in merging r1057897? Besides requiring a lot of beer there is a danger of screwing it up, since we have to re-implement its bulk postings enum.
          Hide
          Michael McCandless added a comment -

          I merged us up to yesterday (1052991:1057836),

          Awesome, thanks!!

          Mike can you assist in merging r1057897?

          Will do.

          Show
          Michael McCandless added a comment - I merged us up to yesterday (1052991:1057836), Awesome, thanks!! Mike can you assist in merging r1057897? Will do.
          Hide
          Robert Muir added a comment -

          Ok, we are caught up to trunk... but we need to integrate getBulkPostingsEnum with termstate to fix the nocommits in TermQuery.

          This should also finally allow us to fix the cost of that extra per-segment docFreq.

          Show
          Robert Muir added a comment - Ok, we are caught up to trunk... but we need to integrate getBulkPostingsEnum with termstate to fix the nocommits in TermQuery. This should also finally allow us to fix the cost of that extra per-segment docFreq.
          Hide
          Simon Willnauer added a comment -

          here is a fix for the nocommit robert put into TermQuery. All tests pass, i will commit in a bit

          Show
          Simon Willnauer added a comment - here is a fix for the nocommit robert put into TermQuery. All tests pass, i will commit in a bit
          Hide
          Simon Willnauer added a comment -

          This patch adds a BulkPostingsEnumWrapper that implement DocsEnumAndPositions by using the bulkpostings. I first just added this as a class to ease testsing for PositionDeltaBulks but it seems that this could be useful for more than just testing. Codecs that don't want to implement the DocsEnumAndPositions API can just use this wrapper to provide the functionality.
          I also added a testcase for MemoryIndex that uses this wrapper

          Show
          Simon Willnauer added a comment - This patch adds a BulkPostingsEnumWrapper that implement DocsEnumAndPositions by using the bulkpostings. I first just added this as a class to ease testsing for PositionDeltaBulks but it seems that this could be useful for more than just testing. Codecs that don't want to implement the DocsEnumAndPositions API can just use this wrapper to provide the functionality. I also added a testcase for MemoryIndex that uses this wrapper
          Hide
          Robert Muir added a comment -

          Simon, just took a quick glance (not a serious review, all the bulkpostings stuff is heavy).

          I agree with the idea that Codecs should only need to implement the bulk api at a minimum:
          if all serious stuff (queries) is using these bulk apis, then the "friendly" iterator methods
          can simply be a wrapper over it.

          but separately, i know there are some performance degradations with the bulk APIs today
          versus trunk... (with the same index). I know if i use other fixed-int codecs i see these same
          problems, so I dont think its just Standard's implementation: pretty sure the issue is somewhere
          with advance()/jump().

          I really wish we could debug whatever this performance problem is, just in case the bulk APIs
          themselves need changing... a little concerned about them at the moment thats all...
          not sure it should stand in the way of your patch, just saying I don't like the performance
          regression.

          Show
          Robert Muir added a comment - Simon, just took a quick glance (not a serious review, all the bulkpostings stuff is heavy). I agree with the idea that Codecs should only need to implement the bulk api at a minimum: if all serious stuff (queries) is using these bulk apis, then the "friendly" iterator methods can simply be a wrapper over it. but separately, i know there are some performance degradations with the bulk APIs today versus trunk... (with the same index). I know if i use other fixed-int codecs i see these same problems, so I dont think its just Standard's implementation: pretty sure the issue is somewhere with advance()/jump(). I really wish we could debug whatever this performance problem is, just in case the bulk APIs themselves need changing... a little concerned about them at the moment thats all... not sure it should stand in the way of your patch, just saying I don't like the performance regression.
          Hide
          Simon Willnauer added a comment -

          I really wish we could debug whatever this performance problem is, just in case the bulk APIs
          themselves need changing... a little concerned about them at the moment thats all...
          not sure it should stand in the way of your patch, just saying I don't like the performance
          regression

          yeah I agree - I think we should open a separate issue to figure out what the problem here is. Unrelated to this, the wrapper gives me the ability to test the bulks apis easily together with the enum API which is valuable in any case. I am opening a new issue and commit that latest patch to the branch with that wrapper moved to /src/test. We can still move it to /src/java later though.

          Show
          Simon Willnauer added a comment - I really wish we could debug whatever this performance problem is, just in case the bulk APIs themselves need changing... a little concerned about them at the moment thats all... not sure it should stand in the way of your patch, just saying I don't like the performance regression yeah I agree - I think we should open a separate issue to figure out what the problem here is. Unrelated to this, the wrapper gives me the ability to test the bulks apis easily together with the enum API which is valuable in any case. I am opening a new issue and commit that latest patch to the branch with that wrapper moved to /src/test. We can still move it to /src/java later though.
          Hide
          Michael McCandless added a comment -

          Simon, the bulk enum wrapper looks great! But maybe we should merge in the BulkPayload additions (from LUCENE-2878), and that way the wrapper is fully functional?

          Show
          Michael McCandless added a comment - Simon, the bulk enum wrapper looks great! But maybe we should merge in the BulkPayload additions (from LUCENE-2878 ), and that way the wrapper is fully functional?
          Hide
          Michael McCandless added a comment -

          We nuked the low level bulk postings API... and BlockPostingsFormat now does bulk reads under the hood and gives great performance ...

          Show
          Michael McCandless added a comment - We nuked the low level bulk postings API... and BlockPostingsFormat now does bulk reads under the hood and gives great performance ...
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development