Lucene - Core
  1. Lucene - Core
  2. LUCENE-3069

Lucene should have an entirely memory resident term dictionary

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.7
    • Component/s: core/index, core/search
    • Labels:
    • Lucene Fields:
      New

      Description

      FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta.

      1. LUCENE-3069.patch
        40 kB
        Han Jiang
      2. LUCENE-3069.patch
        4 kB
        Han Jiang
      3. LUCENE-3069.patch
        4 kB
        Han Jiang
      4. LUCENE-3069.patch
        33 kB
        Han Jiang
      5. LUCENE-3069.patch
        33 kB
        Han Jiang
      6. LUCENE-3069.patch
        34 kB
        Han Jiang
      7. LUCENE-3069.patch
        12 kB
        Han Jiang
      8. LUCENE-3069.patch
        15 kB
        Han Jiang
      9. LUCENE-3069.patch
        9 kB
        Han Jiang
      10. LUCENE-3069.patch
        47 kB
        Han Jiang
      11. LUCENE-3069.patch
        101 kB
        Han Jiang
      12. LUCENE-3069.patch
        14 kB
        Han Jiang
      13. LUCENE-3069.patch
        282 kB
        Han Jiang
      14. LUCENE-3069.patch
        283 kB
        Han Jiang
      15. example.png
        458 kB
        Han Jiang
      16. df-ttf-estimate.txt
        8 kB
        Han Jiang

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          LUCENE-3209 is a new codec that puts everything (terms + postings) in RAM. For this issue I think we should make this controllable, ie so terms can be in RAM but postings remain in the Directory.

          Show
          Michael McCandless added a comment - LUCENE-3209 is a new codec that puts everything (terms + postings) in RAM. For this issue I think we should make this controllable, ie so terms can be in RAM but postings remain in the Directory.
          Hide
          David Smiley added a comment -

          I'd love to see this come to pass. I've been thinking about what goes on a layer beneath TermsEnum (i.e. how it is implemented) as I work on spatial stuff. Geohash prefixes are a natural fit for FSTs; it should compress ridiculously well. There is an approach to building a heatmap (spatial grid faceting) that I'm thinking of that would do 2500 seek()'s for a 50x50 grid; I'd like those seek's to be as fast as possible. I have another approach in mind requiring a slightly different encoding, but it would do 2500 next()'s which should be faster. Nonetheless; it's a lot – ideally the terms dict would be entirely memory resident.

          Show
          David Smiley added a comment - I'd love to see this come to pass. I've been thinking about what goes on a layer beneath TermsEnum (i.e. how it is implemented) as I work on spatial stuff. Geohash prefixes are a natural fit for FSTs; it should compress ridiculously well. There is an approach to building a heatmap (spatial grid faceting) that I'm thinking of that would do 2500 seek()'s for a 50x50 grid; I'd like those seek's to be as fast as possible. I have another approach in mind requiring a slightly different encoding, but it would do 2500 next()'s which should be faster. Nonetheless; it's a lot – ideally the terms dict would be entirely memory resident.
          Hide
          Han Jiang added a comment -

          This project is quite interesting!

          Since we already have an entirely memory resident PF, the target of this project seems to be as below:
          1. implement a simplified version of BlockTreeTerms*;
          2. change the API of current PostingsBastFormat, so that some non-block-based term dic will be possible to plug in it.(ideally, MemoryPF should work with this)

          Show
          Han Jiang added a comment - This project is quite interesting! Since we already have an entirely memory resident PF, the target of this project seems to be as below: 1. implement a simplified version of BlockTreeTerms*; 2. change the API of current PostingsBastFormat, so that some non-block-based term dic will be possible to plug in it.(ideally, MemoryPF should work with this)
          Hide
          Han Jiang added a comment - - edited

          This is my inital proposal for this project: https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/34001
          I'm looking forward to your feedbacks.

          Show
          Han Jiang added a comment - - edited This is my inital proposal for this project: https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/34001 I'm looking forward to your feedbacks.
          Hide
          Michael McCandless added a comment -

          Han would like to tackle this for GSoC 2013...

          Show
          Michael McCandless added a comment - Han would like to tackle this for GSoC 2013...
          Hide
          Han Jiang added a comment -

          the detail ideas/wild thoughts will be put here: https://gist.github.com/sleepsort/5642021

          Show
          Han Jiang added a comment - the detail ideas/wild thoughts will be put here: https://gist.github.com/sleepsort/5642021
          Hide
          Commit Tag Bot added a comment -

          [lucene3069 commit] mikemccand
          http://svn.apache.org/viewvc?view=revision&revision=1493493

          LUCENE-3069: merge trunk changes over

          Show
          Commit Tag Bot added a comment - [lucene3069 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1493493 LUCENE-3069 : merge trunk changes over
          Hide
          Commit Tag Bot added a comment -

          [lucene3069 commit] mikemccand
          http://svn.apache.org/viewvc?view=revision&revision=1493516

          LUCENE-3069: add nocommit/TODO

          Show
          Commit Tag Bot added a comment - [lucene3069 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1493516 LUCENE-3069 : add nocommit/TODO
          Hide
          Commit Tag Bot added a comment -

          [lucene3069 commit] han
          http://svn.apache.org/viewvc?view=revision&revision=1493517

          LUCENE-3069: setField now expose per-field info to term dict

          Show
          Commit Tag Bot added a comment - [lucene3069 commit] han http://svn.apache.org/viewvc?view=revision&revision=1493517 LUCENE-3069 : setField now expose per-field info to term dict
          Hide
          ASF subversion and git services added a comment -

          Commit 1499744 from Han Jiang
          [ https://svn.apache.org/r1499744 ]

          LUCENE-3069: writer part

          Show
          ASF subversion and git services added a comment - Commit 1499744 from Han Jiang [ https://svn.apache.org/r1499744 ] LUCENE-3069 : writer part
          Hide
          ASF subversion and git services added a comment -

          Commit 1500814 from Han Jiang
          [ https://svn.apache.org/r1500814 ]

          LUCENE-3069: reader part, update logic in outputs

          Show
          ASF subversion and git services added a comment - Commit 1500814 from Han Jiang [ https://svn.apache.org/r1500814 ] LUCENE-3069 : reader part, update logic in outputs
          Hide
          ASF subversion and git services added a comment -

          Commit 1501811 from Han Jiang
          [ https://svn.apache.org/r1501811 ]

          LUCENE-3069: use more compact outputs i/o

          Show
          ASF subversion and git services added a comment - Commit 1501811 from Han Jiang [ https://svn.apache.org/r1501811 ] LUCENE-3069 : use more compact outputs i/o
          Hide
          ASF subversion and git services added a comment -

          Commit 1502152 from Han Jiang
          [ https://svn.apache.org/r1502152 ]

          LUCENE-3069: steal bit to encode TTF

          Show
          ASF subversion and git services added a comment - Commit 1502152 from Han Jiang [ https://svn.apache.org/r1502152 ] LUCENE-3069 : steal bit to encode TTF
          Hide
          Han Jiang added a comment - - edited

          Uploaded patch, it is the main part of changes I commited to branch3069.

          The picture shows current impl of outputs (it is fetched from one field in wikimedium5k).

          • long[] (sortable metadata)
          • byte[] (unsortable, generic metadata)
          • df, ttf (term stats)

          A single byte flag is used to indicate whether/which fields current outputs maintains,
          for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq
          an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf).

          Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with
          MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts.

          Following is the performance comparison:

          pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall
          
                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                           Respell       48.13      (4.4%)       15.38      (1.0%)  -68.0% ( -70% -  -65%)
                            Fuzzy2       51.30      (5.3%)       17.47      (1.3%)  -65.9% ( -68% -  -62%)
                            Fuzzy1       52.24      (4.0%)       18.50      (1.2%)  -64.6% ( -67% -  -61%)
                          Wildcard        9.31      (1.7%)        6.16      (2.2%)  -33.8% ( -37% -  -30%)
                           Prefix3       23.25      (1.8%)       19.00      (2.2%)  -18.3% ( -21% -  -14%)
                          PKLookup      244.92      (3.6%)      225.42      (2.3%)   -8.0% ( -13% -   -2%)
                           LowTerm      295.88      (5.5%)      293.27      (4.8%)   -0.9% ( -10% -    9%)
                        HighPhrase       13.62      (6.5%)       13.54      (7.4%)   -0.6% ( -13% -   14%)
                           MedTerm       99.51      (7.8%)       99.19      (7.7%)   -0.3% ( -14% -   16%)
                         MedPhrase      154.63      (9.4%)      154.38     (10.1%)   -0.2% ( -17% -   21%)
                          HighTerm       28.25     (10.7%)       28.25     (10.0%)   -0.0% ( -18% -   23%)
                        OrHighHigh       16.83     (13.3%)       16.86     (13.1%)    0.2% ( -23% -   30%)
                  HighSloppyPhrase        9.02      (4.4%)        9.03      (4.5%)    0.2% (  -8% -    9%)
                         LowPhrase        6.26      (3.4%)        6.27      (4.1%)    0.2% (  -7% -    8%)
                         OrHighMed       13.73     (13.2%)       13.77     (12.8%)    0.3% ( -22% -   30%)
                         OrHighLow       25.65     (13.2%)       25.73     (13.0%)    0.3% ( -22% -   30%)
                   MedSloppyPhrase        6.63      (2.7%)        6.66      (2.7%)    0.5% (  -4% -    6%)
                        AndHighMed       42.77      (1.8%)       43.13      (1.5%)    0.8% (  -2% -    4%)
                   LowSloppyPhrase       32.68      (3.0%)       32.96      (2.8%)    0.8% (  -4% -    6%)
                       AndHighHigh       22.90      (1.2%)       23.18      (0.7%)    1.2% (   0% -    3%)
                       LowSpanNear       29.30      (2.0%)       29.83      (2.2%)    1.8% (  -2% -    6%)
                       MedSpanNear        8.39      (2.7%)        8.56      (2.9%)    2.0% (  -3% -    7%)
                            IntNRQ        3.12      (1.9%)        3.18      (6.7%)    2.1% (  -6% -   10%)
                        AndHighLow      507.01      (2.4%)      522.10      (2.8%)    3.0% (  -2% -    8%)
                      HighSpanNear        5.43      (1.8%)        5.60      (2.6%)    3.1% (  -1% -    7%)
          
          pure TempFST vs. pure Lucene41, on wikimediumall
          
                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                           Respell       49.24      (2.7%)       15.51      (1.0%)  -68.5% ( -70% -  -66%)
                            Fuzzy2       52.01      (4.8%)       17.61      (1.4%)  -66.1% ( -68% -  -63%)
                            Fuzzy1       53.00      (4.0%)       18.62      (1.3%)  -64.9% ( -67% -  -62%)
                          Wildcard        9.37      (1.3%)        6.15      (2.1%)  -34.4% ( -37% -  -31%)
                           Prefix3       23.36      (0.8%)       18.96      (2.1%)  -18.8% ( -21% -  -16%)
                         MedPhrase      155.86      (9.8%)      152.34      (9.7%)   -2.3% ( -19% -   19%)
                         LowPhrase        6.33      (3.7%)        6.23      (4.0%)   -1.6% (  -8% -    6%)
                        HighPhrase       13.68      (7.2%)       13.49      (6.8%)   -1.4% ( -14% -   13%)
                         OrHighMed       13.78     (13.0%)       13.68     (12.7%)   -0.8% ( -23% -   28%)
                  HighSloppyPhrase        9.14      (5.2%)        9.07      (3.7%)   -0.7% (  -9% -    8%)
                        OrHighHigh       16.87     (13.3%)       16.76     (12.9%)   -0.6% ( -23% -   29%)
                         OrHighLow       25.71     (13.1%)       25.58     (12.8%)   -0.5% ( -23% -   29%)
                   MedSloppyPhrase        6.69      (2.7%)        6.67      (2.4%)   -0.3% (  -5% -    4%)
                   LowSloppyPhrase       33.01      (3.2%)       32.99      (2.6%)   -0.1% (  -5% -    5%)
                           MedTerm       99.64      (8.0%)       99.67     (10.9%)    0.0% ( -17% -   20%)
                           LowTerm      294.52      (5.5%)      295.72      (7.2%)    0.4% ( -11% -   13%)
                       LowSpanNear       29.61      (2.6%)       29.76      (2.7%)    0.5% (  -4% -    5%)
                            IntNRQ        3.13      (1.8%)        3.16      (7.8%)    0.8% (  -8% -   10%)
                       MedSpanNear        8.49      (3.0%)        8.57      (3.4%)    0.9% (  -5% -    7%)
                        AndHighMed       42.86      (1.4%)       43.35      (1.4%)    1.1% (  -1% -    3%)
                       AndHighHigh       22.98      (0.6%)       23.26      (0.5%)    1.2% (   0% -    2%)
                      HighSpanNear        5.51      (3.4%)        5.58      (3.4%)    1.3% (  -5% -    8%)
                          HighTerm       28.32     (10.5%)       28.76     (15.0%)    1.6% ( -21% -   30%)
                        AndHighLow      509.60      (2.2%)      526.17      (1.9%)    3.3% (   0% -    7%)
                          PKLookup      156.59      (2.2%)      225.47      (2.8%)   44.0% (  38% -   50%)
          

          To revive the performance on automaton queries, intersect methods should be implemented.

          And index size comparison:
          (actually, after LUCENE-5029, TempBlock has a little larger (5%) index size than Lucene41)

                    wikimedium1m         wikimediumall
          Memory       2,212,352            /
          Lucene41       448,164            12,104,520        
          TempFST        525,888            12,770,700
          

          as for term dict size:

                               wikimedium1m         wikimediumall
          Lucene41(.tim+.tip)  157776               2059744
          TempFST(.tmp)        233636               2779784
                               48%                  35%
          

          Some unresolved problems:

          -* Currently, TempFST uses the default option to build FST (i.e. doPacked = false), when this option is switched on, the index size on wikimedium1m becomes smaller, but on wikimediumall it becomes larger?-

          Show
          Han Jiang added a comment - - edited Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). long[] (sortable metadata) byte[] (unsortable, generic metadata) df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall Task QPS base StdDev QPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard 9.31 (1.7%) 6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% - 9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase 9.02 (4.4%) 9.03 (4.5%) 0.2% ( -8% - 9%) LowPhrase 6.26 (3.4%) 6.27 (4.1%) 0.2% ( -7% - 8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase 6.63 (2.7%) 6.66 (2.7%) 0.5% ( -4% - 6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% - 4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% - 6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% - 3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% - 6%) MedSpanNear 8.39 (2.7%) 8.56 (2.9%) 2.0% ( -3% - 7%) IntNRQ 3.12 (1.9%) 3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% - 8%) HighSpanNear 5.43 (1.8%) 5.60 (2.6%) 3.1% ( -1% - 7%) pure TempFST vs. pure Lucene41, on wikimediumall Task QPS base StdDev QPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard 9.37 (1.3%) 6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase 6.33 (3.7%) 6.23 (4.0%) -1.6% ( -8% - 6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase 9.14 (5.2%) 9.07 (3.7%) -0.7% ( -9% - 8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1%) 25.58 (12.8%) -0.5% ( -23% - 29%) MedSloppyPhrase 6.69 (2.7%) 6.67 (2.4%) -0.3% ( -5% - 4%) LowSloppyPhrase 33.01 (3.2%) 32.99 (2.6%) -0.1% ( -5% - 5%) MedTerm 99.64 (8.0%) 99.67 (10.9%) 0.0% ( -17% - 20%) LowTerm 294.52 (5.5%) 295.72 (7.2%) 0.4% ( -11% - 13%) LowSpanNear 29.61 (2.6%) 29.76 (2.7%) 0.5% ( -4% - 5%) IntNRQ 3.13 (1.8%) 3.16 (7.8%) 0.8% ( -8% - 10%) MedSpanNear 8.49 (3.0%) 8.57 (3.4%) 0.9% ( -5% - 7%) AndHighMed 42.86 (1.4%) 43.35 (1.4%) 1.1% ( -1% - 3%) AndHighHigh 22.98 (0.6%) 23.26 (0.5%) 1.2% ( 0% - 2%) HighSpanNear 5.51 (3.4%) 5.58 (3.4%) 1.3% ( -5% - 8%) HighTerm 28.32 (10.5%) 28.76 (15.0%) 1.6% ( -21% - 30%) AndHighLow 509.60 (2.2%) 526.17 (1.9%) 3.3% ( 0% - 7%) PKLookup 156.59 (2.2%) 225.47 (2.8%) 44.0% ( 38% - 50%) To revive the performance on automaton queries, intersect methods should be implemented. And index size comparison: (actually, after LUCENE-5029 , TempBlock has a little larger (5%) index size than Lucene41) wikimedium1m wikimediumall Memory 2,212,352 / Lucene41 448,164 12,104,520 TempFST 525,888 12,770,700 as for term dict size: wikimedium1m wikimediumall Lucene41(.tim+.tip) 157776 2059744 TempFST(.tmp) 233636 2779784 48% 35% Some unresolved problems: - * Currently, TempFST uses the default option to build FST (i.e. doPacked = false), when this option is switched on, the index size on wikimedium1m becomes smaller, but on wikimediumall it becomes larger? -
          Hide
          Robert Muir added a comment -

          Also, for long-tail terms, the totalTermFreq
          an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf).

          Cool idea! I wonder how many of those are df == ttf == 1?

          We would currently waste a byte in this case (because we write a vInt for docFreq of 1, and then a vInt of totalTermFreq - docFreq of 0).

          Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1?

          Show
          Robert Muir added a comment - Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Cool idea! I wonder how many of those are df == ttf == 1? We would currently waste a byte in this case (because we write a vInt for docFreq of 1, and then a vInt of totalTermFreq - docFreq of 0). Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1?
          Hide
          Han Jiang added a comment -

          Cool idea! I wonder how many of those are df == ttf == 1?

          I didn't try a very precise estimation, but the percentage will be large:

          For the index of wikimedium1m, the larget segment has a 'body' field with:

          bitwidth/7  df==ttf   df
          1           1324400 / 1542987
          2           110     / 18951
          3           0       / 175
          4           0       / 0
          5           0       / 0
          

          That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' since we're using VInt encoding.
          So, for this field, we can save (1324400+110*2) bytes by stealing one bit.

          Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1?

          Yes, that may helps! I'll try to test the percentage. But still we should note that, df is a small part in term dict data.

          Show
          Han Jiang added a comment - Cool idea! I wonder how many of those are df == ttf == 1? I didn't try a very precise estimation, but the percentage will be large: For the index of wikimedium1m, the larget segment has a 'body' field with: bitwidth/7 df==ttf df 1 1324400 / 1542987 2 110 / 18951 3 0 / 175 4 0 / 0 5 0 / 0 That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' since we're using VInt encoding. So, for this field, we can save (1324400+110*2) bytes by stealing one bit. Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1? Yes, that may helps! I'll try to test the percentage. But still we should note that, df is a small part in term dict data.
          Hide
          Han Jiang added a comment - - edited

          I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

          Here is the bit width summary for "body" field:

          bit #(df==ttf) #df #ttf
          1 43532656 48860170 43532656
          2 10328824 13979539 16200377
          3 2682453 5032450 6532755
          4 836109 2471794 3134437
          5 262696 1324704 1718862
          6 86487 755797 990563
          7 29276 442974 571996
          8 11257 263874 339382
          9 4627 161402 205662
          10 2060 102198 128034
          11 979 63955 79531
          12 386 39377 48805
          13 170 24321 30113
          14 65 14686 18437
          15 10 9055 10918
          16 2 5229 6821
          17 0 2669 3595
          18 0 1312 1897
          19 0 696 914
          20 0 209 509
          21 0 44 148
          22 0 4 38
          23 0 0 8
          24 0 0 1
          ... 0 0 0
          tot 57778057 73556459 73556459

          So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
          Using following estimation, the old size for (df+ttf) here is 148.7MB.

          When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
          When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert!

          old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
          new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber)
          opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber)
          

          By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
          When we know that df==ttf, we can always make sure the in-doc frq==1. So for example,
          when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks,
          we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB
          (here I only consider vInt block, since 1-bit ForBlock is usually small) (ah I forgot we already steals bit for this case in Lucene41PBF.

          I'll test this later.

          Show
          Han Jiang added a comment - - edited I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for "body" field: bit #(df==ttf) #df #ttf 1 43532656 48860170 43532656 2 10328824 13979539 16200377 3 2682453 5032450 6532755 4 836109 2471794 3134437 5 262696 1324704 1718862 6 86487 755797 990563 7 29276 442974 571996 8 11257 263874 339382 9 4627 161402 205662 10 2060 102198 128034 11 979 63955 79531 12 386 39377 48805 13 170 24321 30113 14 65 14686 18437 15 10 9055 10918 16 2 5229 6821 17 0 2669 3595 18 0 1312 1897 19 0 696 914 20 0 209 509 21 0 44 148 22 0 4 38 23 0 0 8 24 0 0 1 ... 0 0 0 tot 57778057 73556459 73556459 So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small) (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later.
          Hide
          Han Jiang added a comment - - edited

          Uploaded detail data for wikimediumall.

          Oh, sorry, there is an error when I
          caculated index size for df==0 trick,
          it should be 105MB instead of 70MB.

          But the real test is still beyond
          estimation (weird...). df==0 tricks
          gains similar compression.

          Index size are below(KB):

          v0:                   13195304
          v1 = v0 + flag byte:  12847172
          v2 = v1 + steal bit:  12770700
          v3 = v1 + zero df:    12780884
          

          Another thing that surprised me is, with the same code/conf,
          luceneutil creates different sizes of index? I tested
          that df==0 trick several times on wikimedium1m, the
          index size varies from 514M~522M... Will multi-threading affects
          much here?

          Show
          Han Jiang added a comment - - edited Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below(KB): v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df: 12780884 Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here?
          Hide
          ASF subversion and git services added a comment -

          Commit 1502991 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1502991 ]

          LUCENE-3069: remove redundant info for fields without payload

          Show
          ASF subversion and git services added a comment - Commit 1502991 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1502991 ] LUCENE-3069 : remove redundant info for fields without payload
          Hide
          Michael McCandless added a comment -

          Another thing that surprised me is, with the same code/conf,
          luceneutil creates different sizes of index? I tested
          that df==0 trick several times on wikimedium1m, the
          index size varies from 514M~522M... Will multi-threading affects
          much here?

          Using threads means the docs are assigned to different segments each time you run ... it's interesting this can cause such variance in the index size though.

          It is known that e.g. sorting docs by web site (if you are indexing content from different sites) can give good compression; maybe that's the effect we're seeing here?

          Show
          Michael McCandless added a comment - Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? Using threads means the docs are assigned to different segments each time you run ... it's interesting this can cause such variance in the index size though. It is known that e.g. sorting docs by web site (if you are indexing content from different sites) can give good compression; maybe that's the effect we're seeing here?
          Hide
          Michael McCandless added a comment -

          The new code on the branch looks great! I can't wait to see perf results after we
          implement .intersect()..

          Some small stuff in TempFSTTermsReader.java:

          • In next(), when we handle seekPending=true, I think we should assert
            that the seekCeil returned SeekStatus.FOUND? Ie, it's not
            possible to seekExact(TermState) to a term that doesn't exist.
          • useCache is an ancient option from back when we had a terms dict
            cache; we long ago removed it ... I think we should remove
            useCache parameter too?
          • It's silly that fstEnum.seekCeil doesn't return a status, ie that
            we must re-compare the term we got to differentiate FOUND vs
            NOT_FOUND ... so we lose some perf here. But this is just a
            future TODO ...
          • "nocommit: this method doesn't act as 'seekExact' right?" – not
            sure why this is here; seekExact is working as it should I think.
          • Maybe instead of term and meta members, we could just hold the
            current pair?

          In TempTermOutputs.java:

          • longsSize, hasPos can be final? (Same with TempMetaData's fields)
          • TempMetaData.hashCode() doesn't mix in docFreq/tTF?
          • It doesn't impl equals (must it really impl hashCode?)
          Show
          Michael McCandless added a comment - The new code on the branch looks great! I can't wait to see perf results after we implement .intersect().. Some small stuff in TempFSTTermsReader.java: In next(), when we handle seekPending=true, I think we should assert that the seekCeil returned SeekStatus.FOUND? Ie, it's not possible to seekExact(TermState) to a term that doesn't exist. useCache is an ancient option from back when we had a terms dict cache; we long ago removed it ... I think we should remove useCache parameter too? It's silly that fstEnum.seekCeil doesn't return a status, ie that we must re-compare the term we got to differentiate FOUND vs NOT_FOUND ... so we lose some perf here. But this is just a future TODO ... "nocommit: this method doesn't act as 'seekExact' right?" – not sure why this is here; seekExact is working as it should I think. Maybe instead of term and meta members, we could just hold the current pair? In TempTermOutputs.java: longsSize, hasPos can be final? (Same with TempMetaData's fields) TempMetaData.hashCode() doesn't mix in docFreq/tTF? It doesn't impl equals (must it really impl hashCode?)
          Hide
          Han Jiang added a comment - - edited

          I think we should assert that the seekCeil returned SeekStatus.FOUND?

          Ok! I'll commit that.

          useCache is an ancient option from back when we had a terms dict cache

          Yes, I suppose is is not 'clear' to have this parameter.

          seekExact is working as it should I think.

          Currently, I think those 'seek' methods are supposed to change the enum pointer based on
          input term string, and fetch related metadata from term dict.

          However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which
          doesn't actually operate 'seek' on dictionary.

          Maybe instead of term and meta members, we could just hold the current pair?

          Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that,
          when 'term()' is called, it will always return a valid term?
          The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe?

          TempMetaData.hashCode() doesn't mix in docFreq/tTF?

          Oops! thanks, nice catch!

          It doesn't impl equals (must it really impl hashCode?)

          Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.
          Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that.

          By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats,
          it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if
          equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here.

          Show
          Han Jiang added a comment - - edited I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'. Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats, it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here.
          Hide
          Han Jiang added a comment - - edited

          Patch according to previous comments.

          We still somewhat need the existance of
          hashCode(), because in NodeHash, it will
          check whether the frozen node have the same
          hashcode with uncompiled node (NodeHash.java:128).

          Although later, for nodes with outputs, it'll hardly
          find a same node from hashtable.

          Show
          Han Jiang added a comment - - edited Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash.java:128). Although later, for nodes with outputs, it'll hardly find a same node from hashtable.
          Hide
          Han Jiang added a comment -

          Patch: revert hashCode()

          Show
          Han Jiang added a comment - Patch: revert hashCode()
          Hide
          Michael McCandless added a comment -

          However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary.

          This is normal / by design. It's so that the case of seekExact(TermState) followed by .docs or .docsAndPositions is fast. We only need to re-load the metadata if the caller then tries to do .next()

          Maybe instead of term and meta members, we could just hold the current pair?

          Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that,
          when 'term()' is called, it will always return a valid term?
          The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe?

          We can't guarantee that, but I think we can just check if pair == null and return null from term()?

          By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats,
          it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if
          equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here.

          Hmm, but it seems like we should implement it? Ie we do get a smaller FST when implementing it?

          Show
          Michael McCandless added a comment - However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. This is normal / by design. It's so that the case of seekExact(TermState) followed by .docs or .docsAndPositions is fast. We only need to re-load the metadata if the caller then tries to do .next() Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? We can't guarantee that, but I think we can just check if pair == null and return null from term()? By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats, it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here. Hmm, but it seems like we should implement it? Ie we do get a smaller FST when implementing it?
          Hide
          ASF subversion and git services added a comment -

          Commit 1503781 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1503781 ]

          LUCENE-3069: remove some nocommits, update hashCode() & equal()

          Show
          ASF subversion and git services added a comment - Commit 1503781 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1503781 ] LUCENE-3069 : remove some nocommits, update hashCode() & equal()
          Hide
          ASF subversion and git services added a comment -

          Commit 1503797 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1503797 ]

          LUCENE-3069: merge trunk changes over

          Show
          ASF subversion and git services added a comment - Commit 1503797 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1503797 ] LUCENE-3069 : merge trunk changes over
          Hide
          Han Jiang added a comment -

          Upload patch: implemented IntersectEnum.next() & seekCeil()
          lots of nocommits, but passed all tests

          The main idea is to run a DFS on FST, and backtrack as early as
          possible (i.e. when we see this label is rejected by automaton)

          For this version, there is one explicit perf overhead: I use a
          real stack here, which can be replaced by a Frame[] to reuse objects.

          There're several aspects I didn't dig deep:

          • currently, CompiledAutomaton provides a commonSuffixRef, but how
            can we make use of it in FST?
          • the DFS is somewhat a 'goto' version, i.e, we can make the code
            cleaner with a single while-loop similar to BFS search.
            However, since FST doesn't always tell us how may arcs are leaving
            current arc, we have problem dealing with this...
          • when FST is large enough, the next() operation will takes much time
            doing the linear arc read, maybe we should make use of
            CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.
          Show
          Han Jiang added a comment - Upload patch: implemented IntersectEnum.next() & seekCeil() lots of nocommits, but passed all tests The main idea is to run a DFS on FST, and backtrack as early as possible (i.e. when we see this label is rejected by automaton) For this version, there is one explicit perf overhead: I use a real stack here, which can be replaced by a Frame[] to reuse objects. There're several aspects I didn't dig deep: currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST? the DFS is somewhat a 'goto' version, i.e, we can make the code cleaner with a single while-loop similar to BFS search. However, since FST doesn't always tell us how may arcs are leaving current arc, we have problem dealing with this... when FST is large enough, the next() operation will takes much time doing the linear arc read, maybe we should make use of CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.
          Hide
          Michael McCandless added a comment -

          Patch looks great! Wonderful how you were able to share some code in
          BaseTermsEnum...

          It looks like you impl'd seekCeil in general for the IntersectEnum? Wild

          You should not need to .getPosition / .setPosition on the fstReader:
          the FST APIs do this under-the-hood.

          currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST?

          I think we can't really make use of it, which is fine (it's an
          optional optimization).

          when FST is large enough, the next() operation will takes much time
          doing the linear arc read, maybe we should make use of
          CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.

          Interesting ... you mean e.g. if the Automaton is very restrictive
          compared to the FST, then we can do a binary search. But this can
          only be done if that FST node's arcs are array'd right?

          Separately, supporting ord w/ FST terms dict should in theory be not
          so hard; you'd need to use getByOutput to seek by ord. Maybe (later,
          eventually) we can make this a write-time option. We should open a
          separate issue ...

          Show
          Michael McCandless added a comment - Patch looks great! Wonderful how you were able to share some code in BaseTermsEnum... It looks like you impl'd seekCeil in general for the IntersectEnum? Wild You should not need to .getPosition / .setPosition on the fstReader: the FST APIs do this under-the-hood. currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST? I think we can't really make use of it, which is fine (it's an optional optimization). when FST is large enough, the next() operation will takes much time doing the linear arc read, maybe we should make use of CompiledAutomaton.sortedTransition[] when leaving arcs are heavy. Interesting ... you mean e.g. if the Automaton is very restrictive compared to the FST, then we can do a binary search. But this can only be done if that FST node's arcs are array'd right? Separately, supporting ord w/ FST terms dict should in theory be not so hard; you'd need to use getByOutput to seek by ord. Maybe (later, eventually) we can make this a write-time option. We should open a separate issue ...
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Han Jiang added a comment -

          You should not need to .getPosition / .setPosition on the fstReader:

          Oh, yes! I'll fix.

          I think we can't really make use of it, which is fine (it's an optional optimization).

          OK, actually I was quite curious why we don't make use of commonPrefixRef
          in CompiledAutomaton. Maybe we can determinize the input Automaton first, then
          get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't
          always long enough to take into consideration?

          But this can only be done if that FST node's arcs are array'd right?

          Yes, array arcs only, and we might need methods like advance(label) to do the search,
          and here gossip search might work better than traditional binary search.

          Separately, supporting ord w/ FST terms dict should in theory be not
          so hard; you'd need to use getByOutput to seek by ord. Maybe (later,
          eventually) we can make this a write-time option. We should open a
          separate issue ...

          Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state?
          We always have to start from first arc during every seek. However, I'm
          not sure in what kinds of usecase we need the ord information.

          I'll commit current version first, so we can iterate.

          Show
          Han Jiang added a comment - You should not need to .getPosition / .setPosition on the fstReader: Oh, yes! I'll fix. I think we can't really make use of it, which is fine (it's an optional optimization). OK, actually I was quite curious why we don't make use of commonPrefixRef in CompiledAutomaton. Maybe we can determinize the input Automaton first, then get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't always long enough to take into consideration? But this can only be done if that FST node's arcs are array'd right? Yes, array arcs only, and we might need methods like advance(label) to do the search, and here gossip search might work better than traditional binary search. Separately, supporting ord w/ FST terms dict should in theory be not so hard; you'd need to use getByOutput to seek by ord. Maybe (later, eventually) we can make this a write-time option. We should open a separate issue ... Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state? We always have to start from first arc during every seek. However, I'm not sure in what kinds of usecase we need the ord information. I'll commit current version first, so we can iterate.
          Hide
          ASF subversion and git services added a comment -

          Commit 1506385 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1506385 ]

          LUCENE-3069: support intersect operations

          Show
          ASF subversion and git services added a comment - Commit 1506385 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506385 ] LUCENE-3069 : support intersect operations
          Hide
          ASF subversion and git services added a comment -

          Commit 1506389 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1506389 ]

          LUCENE-3069: no need to reseek FSTReader, update nocommits

          Show
          ASF subversion and git services added a comment - Commit 1506389 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506389 ] LUCENE-3069 : no need to reseek FSTReader, update nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 1506439 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1506439 ]

          LUCENE-3069: stack reuses objects during DFS

          Show
          ASF subversion and git services added a comment - Commit 1506439 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506439 ] LUCENE-3069 : stack reuses objects during DFS
          Hide
          ASF subversion and git services added a comment -

          Commit 1506612 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1506612 ]

          LUCENE-3069: accumulate metadata lazily

          Show
          ASF subversion and git services added a comment - Commit 1506612 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1506612 ] LUCENE-3069 : accumulate metadata lazily
          Hide
          Han Jiang added a comment -

          Previous design put much stress on decoding of Outputs.
          This becomes disaster for wildcard queries: like for f*nd,
          we usually have to walk to the last character of FST, then
          find that it is not 'd' and automaton doesn't accept this.
          In this case, TempFST is actually iterating all the result
          of f*, which decodes all the metadata for them...

          So I'm trying another approach, the main idea is to load
          metadata & stats as lazily as possible.
          Here I use FST<Long> as term index, and leave all other stuff
          in a single term block. The term index FST holds the relationship
          between <Term, Ord>, and in the term block we can maintain a skip list
          for find related metadata & stats.

          It is a little similar to BTTR now, and we can someday control how much
          data to keep memory resident (e.g. keep stats in memory but metadata on
          disk, however this should be another issue).
          Another good part is, it naturally supports seek by ord.(ah,
          actually I don't understand where it is used).

          Tests pass, and intersect is not implemented yet.
          perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          PKLookup      373.80      (0.0%)      320.30      (0.0%)  -14.3% ( -14% -  -14%)
                            Fuzzy1       43.82      (0.0%)       47.10      (0.0%)    7.5% (   7% -    7%)
                           Prefix3      399.62      (0.0%)      433.95      (0.0%)    8.6% (   8% -    8%)
                            Fuzzy2       14.26      (0.0%)       15.95      (0.0%)   11.9% (  11% -   11%)
                           Respell       40.69      (0.0%)       46.29      (0.0%)   13.8% (  13% -   13%)
                          Wildcard       83.44      (0.0%)       96.54      (0.0%)   15.7% (  15% -   15%)
          

          perf hit on pklookup should be sane, since I haven't optimize the skip list.

          I'll update intersect() later, and later we'll cutover to
          PagedBytes & PackedLongBuffer.

          Show
          Han Jiang added a comment - Previous design put much stress on decoding of Outputs. This becomes disaster for wildcard queries: like for f*nd, we usually have to walk to the last character of FST, then find that it is not 'd' and automaton doesn't accept this. In this case, TempFST is actually iterating all the result of f*, which decodes all the metadata for them... So I'm trying another approach, the main idea is to load metadata & stats as lazily as possible. Here I use FST<Long> as term index, and leave all other stuff in a single term block. The term index FST holds the relationship between <Term, Ord>, and in the term block we can maintain a skip list for find related metadata & stats. It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). Tests pass, and intersect is not implemented yet. perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd: Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 373.80 (0.0%) 320.30 (0.0%) -14.3% ( -14% - -14%) Fuzzy1 43.82 (0.0%) 47.10 (0.0%) 7.5% ( 7% - 7%) Prefix3 399.62 (0.0%) 433.95 (0.0%) 8.6% ( 8% - 8%) Fuzzy2 14.26 (0.0%) 15.95 (0.0%) 11.9% ( 11% - 11%) Respell 40.69 (0.0%) 46.29 (0.0%) 13.8% ( 13% - 13%) Wildcard 83.44 (0.0%) 96.54 (0.0%) 15.7% ( 15% - 15%) perf hit on pklookup should be sane, since I haven't optimize the skip list. I'll update intersect() later, and later we'll cutover to PagedBytes & PackedLongBuffer.
          Hide
          Michael McCandless added a comment -

          Wow, those are nice perf results, without implementing intersect!

          Intersect really is an optional operation, so we could stop here/now and button everything up

          I like this approach: you moved all the metadata (docFreq, totalTermFreq, long[] and byte[] from the PostingsFormatBase) into blocks, and then when we really need a term's metadata we go to its block and scan for it (like block tree).

          I wonder if we could use MonotonicAppendingLongBuffer instead of long[] for the in-memory skip data? Right now it's I think 48 bytes per block (block = 128 terms), so I guess that's fairly small (.375 bytes per term).

          It is a little similar to BTTR now, and we can someday control how much
          data to keep memory resident (e.g. keep stats in memory but metadata on
          disk, however this should be another issue).

          That's a nice (future) plus; this way the app can keep "only" the terms+ords in RAM, and leave all term metadata on disk. But this is definitely optional for the project and we should separately explore it ...

          Another good part is, it naturally supports seek by ord.(ah,
          actually I don't understand where it is used).

          This is also a nice side-effect!

          Show
          Michael McCandless added a comment - Wow, those are nice perf results, without implementing intersect! Intersect really is an optional operation, so we could stop here/now and button everything up I like this approach: you moved all the metadata (docFreq, totalTermFreq, long[] and byte[] from the PostingsFormatBase) into blocks, and then when we really need a term's metadata we go to its block and scan for it (like block tree). I wonder if we could use MonotonicAppendingLongBuffer instead of long[] for the in-memory skip data? Right now it's I think 48 bytes per block (block = 128 terms), so I guess that's fairly small (.375 bytes per term). It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). That's a nice (future) plus; this way the app can keep "only" the terms+ords in RAM, and leave all term metadata on disk. But this is definitely optional for the project and we should separately explore it ... Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). This is also a nice side-effect!
          Hide
          David Smiley added a comment -

          Nice work! The spatial prefix trees will have even more awesome performance with all terms in RAM. It'd be nice if I could configure the docFreq to be memory resident but, as Mike said, adding options like that can be explored later.

          Show
          David Smiley added a comment - Nice work! The spatial prefix trees will have even more awesome performance with all terms in RAM. It'd be nice if I could configure the docFreq to be memory resident but, as Mike said, adding options like that can be explored later.
          Hide
          ASF subversion and git services added a comment -

          Commit 1508705 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1508705 ]

          LUCENE-3069: add TempFSTOrd, with FST index + specialized block

          Show
          ASF subversion and git services added a comment - Commit 1508705 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1508705 ] LUCENE-3069 : add TempFSTOrd, with FST index + specialized block
          Hide
          Han Jiang added a comment -

          Patch, revive IntersectTermsEnum in TempFSTOrd.

          Mike, since we already have an intersect() impl, maybe we can still keep this? By the way, it is easy to migrate from TempFST to TempFSTOrd.

          Show
          Han Jiang added a comment - Patch, revive IntersectTermsEnum in TempFSTOrd. Mike, since we already have an intersect() impl, maybe we can still keep this? By the way, it is easy to migrate from TempFST to TempFSTOrd.
          Hide
          Han Jiang added a comment -

          Performance result after last patch(intersect) is applied.

          On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with intersect):

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          PKLookup      232.47      (1.0%)      205.28      (2.0%)  -11.7% ( -14% -   -8%)
                           Prefix3       26.93      (1.2%)       28.40      (1.4%)    5.5% (   2% -    8%)
                          Wildcard        6.75      (2.1%)        7.37      (1.5%)    9.2% (   5% -   13%)
                            Fuzzy1       29.86      (1.8%)       51.87      (3.7%)   73.7% (  67% -   80%)
                            Fuzzy2       30.82      (1.6%)       53.82      (2.7%)   74.7% (  69% -   80%)
                           Respell       27.30      (1.2%)       49.55      (2.6%)   81.5% (  76% -   86%)
          

          So the decoding of outputs is really the main hurt.

          And now we should start to compare it with trunk (base=Lucene41, comp=TempFSTOrd):
          Hmm, I must have done something wrong on wildcard query here.

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          Wildcard       19.21      (2.1%)        7.30      (0.3%)  -62.0% ( -63% -  -60%)
                           Prefix3       33.69      (1.2%)       28.18      (0.9%)  -16.4% ( -18% -  -14%)
                            Fuzzy1       61.59      (2.1%)       52.36      (0.8%)  -15.0% ( -17% -  -12%)
                            Fuzzy2       60.94      (1.0%)       54.15      (1.3%)  -11.1% ( -13% -   -8%)
                           Respell       54.21      (2.8%)       49.54      (1.2%)   -8.6% ( -12% -   -4%)
                          PKLookup      148.40      (1.0%)      208.07      (3.6%)   40.2% (  35% -   45%)
          

          I'll commit current version so we can iterate on it.

          Show
          Han Jiang added a comment - Performance result after last patch(intersect) is applied. On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with intersect): Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 232.47 (1.0%) 205.28 (2.0%) -11.7% ( -14% - -8%) Prefix3 26.93 (1.2%) 28.40 (1.4%) 5.5% ( 2% - 8%) Wildcard 6.75 (2.1%) 7.37 (1.5%) 9.2% ( 5% - 13%) Fuzzy1 29.86 (1.8%) 51.87 (3.7%) 73.7% ( 67% - 80%) Fuzzy2 30.82 (1.6%) 53.82 (2.7%) 74.7% ( 69% - 80%) Respell 27.30 (1.2%) 49.55 (2.6%) 81.5% ( 76% - 86%) So the decoding of outputs is really the main hurt. And now we should start to compare it with trunk (base=Lucene41, comp=TempFSTOrd): Hmm, I must have done something wrong on wildcard query here. Task QPS base StdDev QPS comp StdDev Pct diff Wildcard 19.21 (2.1%) 7.30 (0.3%) -62.0% ( -63% - -60%) Prefix3 33.69 (1.2%) 28.18 (0.9%) -16.4% ( -18% - -14%) Fuzzy1 61.59 (2.1%) 52.36 (0.8%) -15.0% ( -17% - -12%) Fuzzy2 60.94 (1.0%) 54.15 (1.3%) -11.1% ( -13% - -8%) Respell 54.21 (2.8%) 49.54 (1.2%) -8.6% ( -12% - -4%) PKLookup 148.40 (1.0%) 208.07 (3.6%) 40.2% ( 35% - 45%) I'll commit current version so we can iterate on it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1508744 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1508744 ]

          LUCENE-3069: introduce intersect() to TempFSTOrd

          Show
          ASF subversion and git services added a comment - Commit 1508744 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1508744 ] LUCENE-3069 : introduce intersect() to TempFSTOrd
          Hide
          Michael McCandless added a comment -

          Mike, since we already have an intersect() impl, maybe we can still keep this?

          +1

          It's odd that WildcardQuery is so angry; I wonder if it's because we can't use the commonSuffix opto. Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)?

          Show
          Michael McCandless added a comment - Mike, since we already have an intersect() impl, maybe we can still keep this? +1 It's odd that WildcardQuery is so angry; I wonder if it's because we can't use the commonSuffix opto. Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)?
          Hide
          Han Jiang added a comment -

          Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)?

          I replace all the ab*c in tasks file with ab*c*, but the performance hit is still heavy:

          33M wikidata, Lucene41 vs. TempFSTOrd

          Wildcard        7.40      (1.9%)        4.63      (1.2%)  -37.5% ( -39% -  -34%)
          
          Show
          Han Jiang added a comment - Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)? I replace all the ab*c in tasks file with ab*c*, but the performance hit is still heavy: 33M wikidata, Lucene41 vs. TempFSTOrd Wildcard 7.40 (1.9%) 4.63 (1.2%) -37.5% ( -39% - -34%)
          Hide
          Han Jiang added a comment -

          Uploaded patch.

          It is optimized for wildcardquery, and I did a quick test on 1M wiki data:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          PKLookup      314.63      (1.5%)      314.64      (1.2%)    0.0% (  -2% -    2%)
                            Fuzzy1       91.32      (3.7%)       92.50      (1.6%)    1.3% (  -3% -    6%)
                           Respell      104.54      (3.9%)      106.97      (1.6%)    2.3% (  -2% -    8%)
                            Fuzzy2       38.22      (4.1%)       39.16      (1.2%)    2.5% (  -2% -    8%)
                          Wildcard      109.56      (3.1%)      273.42      (5.0%)  149.6% ( 137% -  162%)
          

          and TempFSTOrd vs. Lucene41, on 1M data:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                           Respell      134.85      (3.7%)      106.30      (0.6%)  -21.2% ( -24% -  -17%)
                            Fuzzy2       47.78      (4.1%)       39.03      (0.9%)  -18.3% ( -22% -  -13%)
                            Fuzzy1      112.02      (3.0%)       91.95      (0.6%)  -17.9% ( -20% -  -14%)
                          Wildcard      326.68      (3.5%)      273.41      (1.9%)  -16.3% ( -20% -  -11%)
                          PKLookup      194.61      (1.8%)      314.24      (0.7%)   61.5% (  57% -   65%)
          

          But I'm not happy with it , the hack I did here is to consume another big block to store the last byte of each term. So for wildcard query ab*c, we have external information to tell the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput to jump to the next target term.

          Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c', so this optimization comes to be a big chunk.

          However I don't really like this patch , we have to increase index size (521M => 530M), and the code comes to be mess up, since we always have to foresee the next arc on current stack.

          Show
          Han Jiang added a comment - Uploaded patch. It is optimized for wildcardquery, and I did a quick test on 1M wiki data: Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 314.63 (1.5%) 314.64 (1.2%) 0.0% ( -2% - 2%) Fuzzy1 91.32 (3.7%) 92.50 (1.6%) 1.3% ( -3% - 6%) Respell 104.54 (3.9%) 106.97 (1.6%) 2.3% ( -2% - 8%) Fuzzy2 38.22 (4.1%) 39.16 (1.2%) 2.5% ( -2% - 8%) Wildcard 109.56 (3.1%) 273.42 (5.0%) 149.6% ( 137% - 162%) and TempFSTOrd vs. Lucene41, on 1M data: Task QPS base StdDev QPS comp StdDev Pct diff Respell 134.85 (3.7%) 106.30 (0.6%) -21.2% ( -24% - -17%) Fuzzy2 47.78 (4.1%) 39.03 (0.9%) -18.3% ( -22% - -13%) Fuzzy1 112.02 (3.0%) 91.95 (0.6%) -17.9% ( -20% - -14%) Wildcard 326.68 (3.5%) 273.41 (1.9%) -16.3% ( -20% - -11%) PKLookup 194.61 (1.8%) 314.24 (0.7%) 61.5% ( 57% - 65%) But I'm not happy with it , the hack I did here is to consume another big block to store the last byte of each term. So for wildcard query ab*c, we have external information to tell the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput to jump to the next target term. Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c', so this optimization comes to be a big chunk. However I don't really like this patch , we have to increase index size (521M => 530M), and the code comes to be mess up, since we always have to foresee the next arc on current stack.
          Hide
          ASF subversion and git services added a comment -

          Commit 1513336 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1513336 ]

          LUCENE-3069: merge trunk changes over

          Show
          ASF subversion and git services added a comment - Commit 1513336 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1513336 ] LUCENE-3069 : merge trunk changes over
          Hide
          Han Jiang added a comment -

          Hi, currently, we have problem when migrating the codes to trunk:

          The API refactoring on PostingsReader/WriterBase now splits term metadata into two parts:
          monotonic long[] and generical byte[], the former is known by term dictionary for better
          d-gap encoding.

          So we need a 'longsSize' in field summary, to tell reader the fixed length of this monotonic
          long[]. However, this API change actually breaks backward compability: the old 4.x indices didn't
          support this, and for some codec like Lucene40, since their writer part are already deprecated,
          their tests won't pass.

          It seems like we can put all the metadata in generic byte[] and let PBF do its own buffering
          (like we do in old API: nextTerm() ), however we'll have to add logics for this, in every PBF then.

          So... can we solve this problem more elegantly?

          Show
          Han Jiang added a comment - Hi, currently, we have problem when migrating the codes to trunk: The API refactoring on PostingsReader/WriterBase now splits term metadata into two parts: monotonic long[] and generical byte[], the former is known by term dictionary for better d-gap encoding. So we need a 'longsSize' in field summary, to tell reader the fixed length of this monotonic long[]. However, this API change actually breaks backward compability: the old 4.x indices didn't support this, and for some codec like Lucene40, since their writer part are already deprecated, their tests won't pass. It seems like we can put all the metadata in generic byte[] and let PBF do its own buffering (like we do in old API: nextTerm() ), however we'll have to add logics for this, in every PBF then. So... can we solve this problem more elegantly?
          Hide
          Han Jiang added a comment -

          Patch with backward compability fix on Lucene41PBF (TempPostingsReader is actually a fork of Lucene41PostingsReader).

          Show
          Han Jiang added a comment - Patch with backward compability fix on Lucene41PBF (TempPostingsReader is actually a fork of Lucene41PostingsReader).
          Hide
          Han Jiang added a comment -

          Patch, update BlockTerms dict so that it follows refactored API.

          Show
          Han Jiang added a comment - Patch, update BlockTerms dict so that it follows refactored API.
          Hide
          ASF subversion and git services added a comment -

          Commit 1514253 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1514253 ]

          LUCENE-3069: API refactoring on BlockTerms dict

          Show
          ASF subversion and git services added a comment - Commit 1514253 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1514253 ] LUCENE-3069 : API refactoring on BlockTerms dict
          Hide
          ASF subversion and git services added a comment -

          Commit 1515469 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1515469 ]

          LUCENE-3069: API refactoring on Pulsing PF

          Show
          ASF subversion and git services added a comment - Commit 1515469 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1515469 ] LUCENE-3069 : API refactoring on Pulsing PF
          Hide
          ASF subversion and git services added a comment -

          Commit 1516365 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1516365 ]

          LUCENE-3069: API refactoring on Sep/IntBlock PF

          Show
          ASF subversion and git services added a comment - Commit 1516365 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516365 ] LUCENE-3069 : API refactoring on Sep/IntBlock PF
          Hide
          ASF subversion and git services added a comment -

          Commit 1516677 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1516677 ]

          LUCENE-3069: API refactoring on MockRandom, revert supress codec in compatibility test

          Show
          ASF subversion and git services added a comment - Commit 1516677 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516677 ] LUCENE-3069 : API refactoring on MockRandom, revert supress codec in compatibility test
          Hide
          ASF subversion and git services added a comment -

          Commit 1516742 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1516742 ]

          LUCENE-3069: API refactoring on Lucene40RW

          Show
          ASF subversion and git services added a comment - Commit 1516742 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516742 ] LUCENE-3069 : API refactoring on Lucene40RW
          Hide
          Han Jiang added a comment -

          Patch, it will show how current codecs (Block/BlockTree + Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. TestBackwardsCompatibility still fails, and I'll work on the impersonation later.

          Show
          Han Jiang added a comment - Patch, it will show how current codecs (Block/BlockTree + Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. TestBackwardsCompatibility still fails, and I'll work on the impersonation later.
          Hide
          Michael McCandless added a comment -

          Patch looks great on quick look! I'll look more when I'm back
          online...

          One thing: I think e.g. BlockTreeTermsReader needs some back-compat
          code, so it won't try to read longsSize on old indices?

          Show
          Michael McCandless added a comment - Patch looks great on quick look! I'll look more when I'm back online... One thing: I think e.g. BlockTreeTermsReader needs some back-compat code, so it won't try to read longsSize on old indices?
          Hide
          ASF subversion and git services added a comment -

          Commit 1516860 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1516860 ]

          LUCENE-3069: merge 'temp' codes back

          Show
          ASF subversion and git services added a comment - Commit 1516860 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1516860 ] LUCENE-3069 : merge 'temp' codes back
          Hide
          Han Jiang added a comment -

          Patch looks great on quick look! I'll look more when I'm back

          online...

          OK! I commit it so that we can see later changes.

          One thing: I think e.g. BlockTreeTermsReader needs some back-compat

          code, so it won't try to read longsSize on old indices?

          Yes, both two Block* term dict will have a new VERSION variable to mark the
          change, and if codec header shows a previous version, they will not read
          that longSize VInt.

          Show
          Han Jiang added a comment - Patch looks great on quick look! I'll look more when I'm back online... OK! I commit it so that we can see later changes. One thing: I think e.g. BlockTreeTermsReader needs some back-compat code, so it won't try to read longsSize on old indices? Yes, both two Block* term dict will have a new VERSION variable to mark the change, and if codec header shows a previous version, they will not read that longSize VInt.
          Hide
          ASF subversion and git services added a comment -

          Commit 1517792 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1517792 ]

          LUCENE-3069: add version check for impersonation

          Show
          ASF subversion and git services added a comment - Commit 1517792 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1517792 ] LUCENE-3069 : add version check for impersonation
          Hide
          Han Jiang added a comment -

          Patch, to show the impersonation hack for Pulsing format.

          We cannot perfectly impersonate old pulsing format yet: the old format divided metadata block as inlined bytes and wrapped bytes, so when the term dict reader reads the length of metadata block, it is actually the length of 'inlined block'... And the 'wrapped block' won't be loaded for wrapped PF.

          However, to introduce a new method in PostingsReaderBase doesn't seem to be a good way...

          Show
          Han Jiang added a comment - Patch, to show the impersonation hack for Pulsing format. We cannot perfectly impersonate old pulsing format yet: the old format divided metadata block as inlined bytes and wrapped bytes, so when the term dict reader reads the length of metadata block, it is actually the length of 'inlined block'... And the 'wrapped block' won't be loaded for wrapped PF. However, to introduce a new method in PostingsReaderBase doesn't seem to be a good way...
          Hide
          Michael McCandless added a comment -

          PostingsReaderBase.pulsed is quite crazy ... really the terms dict
          should not need this information, ideally.

          Pulsing has no back-compat guarantees, so it's fine to only support
          writing the "new" format and being able to read it. Ie, if this
          change is only for impersonation then we shouldn't need to do it, I
          think?

          Also, this is spooky:

          int start = (int)in.getFilePointer();
          

          Isn't that unsafe in general? Ie it could overflow int...

          Show
          Michael McCandless added a comment - PostingsReaderBase.pulsed is quite crazy ... really the terms dict should not need this information, ideally. Pulsing has no back-compat guarantees, so it's fine to only support writing the "new" format and being able to read it. Ie, if this change is only for impersonation then we shouldn't need to do it, I think? Also, this is spooky: int start = ( int )in.getFilePointer(); Isn't that unsafe in general? Ie it could overflow int...
          Hide
          ASF subversion and git services added a comment -

          Commit 1518989 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1518989 ]

          LUCENE-3069: merge trunk changes

          Show
          ASF subversion and git services added a comment - Commit 1518989 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1518989 ] LUCENE-3069 : merge trunk changes
          Hide
          ASF subversion and git services added a comment -

          Commit 1519542 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1519542 ]

          LUCENE-3069: update javadocs, fix impersonator bug

          Show
          ASF subversion and git services added a comment - Commit 1519542 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1519542 ] LUCENE-3069 : update javadocs, fix impersonator bug
          Hide
          Han Jiang added a comment -

          The uploaded patch should show all the changes against trunk: I added two different implementations of term dict, and refactored the PostingsBaseFormat to plug in non-block based term dicts.

          I'm still working on the javadocs, and maybe we should rename that 'temp' package, like 'fstterms'?

          Show
          Han Jiang added a comment - The uploaded patch should show all the changes against trunk: I added two different implementations of term dict, and refactored the PostingsBaseFormat to plug in non-block based term dicts. I'm still working on the javadocs, and maybe we should rename that 'temp' package, like 'fstterms'?
          Hide
          ASF subversion and git services added a comment -

          Commit 1519909 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1519909 ]

          LUCENE-3069: javadocs

          Show
          ASF subversion and git services added a comment - Commit 1519909 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1519909 ] LUCENE-3069 : javadocs
          Hide
          Michael McCandless added a comment -

          Thanks for uploading the diffs against trunk, Han; I'll review this.

          Can you explain the two new terms dict impls? And maybe write up a brief summary of all the changes (to help others understand the patch)?

          Maybe we can put the new "all in memory" terms dict impls under oal.codecs.memory? FSTTerms* seems like a good name? (Just because in the future maybe we have other impls of "all in memory" terms dicts)...

          Show
          Michael McCandless added a comment - Thanks for uploading the diffs against trunk, Han; I'll review this. Can you explain the two new terms dict impls? And maybe write up a brief summary of all the changes (to help others understand the patch)? Maybe we can put the new "all in memory" terms dict impls under oal.codecs.memory? FSTTerms* seems like a good name? (Just because in the future maybe we have other impls of "all in memory" terms dicts)...
          Hide
          Han Jiang added a comment -

          OK! These two term dicts are both FST-based:

          • FST term dict directly uses FST to map term to its metadata & stats (FST<TermData>)
          • FSTOrd term dict uses FST to map term to its ordinal number (FST<Long>), and the ordinal is then used to seek metadata from another big chunk.

          I prefer the second impl since it puts much less stress on FST.

          I have updated the detailed format explaination in last commit. Hmm, I'll create another patch for this...

          Show
          Han Jiang added a comment - OK! These two term dicts are both FST-based: FST term dict directly uses FST to map term to its metadata & stats (FST<TermData>) FSTOrd term dict uses FST to map term to its ordinal number (FST<Long>), and the ordinal is then used to seek metadata from another big chunk. I prefer the second impl since it puts much less stress on FST. I have updated the detailed format explaination in last commit. Hmm, I'll create another patch for this...
          Hide
          David Smiley added a comment -

          I like FSTOrd as well. Presumably this one also exposes it via TermsEnum.ord()?

          Show
          David Smiley added a comment - I like FSTOrd as well. Presumably this one also exposes it via TermsEnum.ord()?
          Hide
          Han Jiang added a comment -

          Yes, with slight changes, it can support seek by ord. (With FST.getByOutput).

          Show
          Han Jiang added a comment - Yes, with slight changes, it can support seek by ord. (With FST.getByOutput).
          Hide
          ASF subversion and git services added a comment -

          Commit 1520034 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1520034 ]

          LUCENE-3069: move TermDict impls to package 'memory', nuke all 'Temp' symbols

          Show
          ASF subversion and git services added a comment - Commit 1520034 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520034 ] LUCENE-3069 : move TermDict impls to package 'memory', nuke all 'Temp' symbols
          Hide
          Han Jiang added a comment -

          Patch from last commit, and summary:

          Previously our term dictionary were both block-based:

          • BlockTerms dict breaks terms list into several blocks, as a linear
            structure with skip points.
          • BlockTreeTerms dict uses a trie-like structure to decide how terms are
            assigned to different blocks, and uses an FST index to optimize seeking
            performance.

          However, those two kinds of term dictionary don't hold all the term
          data in memory. For the worst case there would be at least two seeks:
          one from index in memory, another from file on disk. And we already have
          many complicated optimizations for this...

          If by design a term dictionary can be memory resident, the data structure
          will be simpler (after all we don't need maintain extra file pointers for
          a second-time seek, and we don't have to decide heuristic for how terms
          are clustered). And this is why those two FST-based implementation are
          introduced.

          Another big change in the code is: since our term dictionaries were both
          block-based, previous API was also limited. It was the postings writer who
          collected term metadata, and the term dictionary who told postings writer
          the range of terms it should flush to block. However, encoding of terms
          data should be decided by term dictionary part, since postings writer
          doesn't always know how terms are structured in term dictionary...
          Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had
          to use terms' ordinal in block to decide how to write metadata, which is
          unnecessary.

          To make the API between term dict and postings list more 'pluggable' and
          'general', I refactored the PostingsReader/WriterBase. For example, the
          postings writer should provide some information to term dictionary, like
          how many metadata values are strictly monotonic, so that term dictionary
          can optimize delta-encoding itself. And since the term dictionary now fully
          decides how metadata are written, it gets the ability to utilize
          intblock-based metadata encoding.

          Now the two implementations of term dictionary can easily be plugged with
          current postings formats, like:

          • FST41 =
            FSTTermdict + Lucene41PostingsBaseFormat,
          • FSTOrd41 =
            FSTOrdTermdict + Lucene41PostingsBaseFormat.
          • FSTOrdPulsing41 =
            FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat

          About performance, as shown before, those two term dict improve on primary
          key lookup, but still have overhead on wildcard query (both two term dict
          have only prefix information, and term dictionary cannot work well with
          this...). I'll try to hack this later.

          Show
          Han Jiang added a comment - Patch from last commit, and summary: Previously our term dictionary were both block-based: BlockTerms dict breaks terms list into several blocks, as a linear structure with skip points. BlockTreeTerms dict uses a trie-like structure to decide how terms are assigned to different blocks, and uses an FST index to optimize seeking performance. However, those two kinds of term dictionary don't hold all the term data in memory. For the worst case there would be at least two seeks: one from index in memory, another from file on disk. And we already have many complicated optimizations for this... If by design a term dictionary can be memory resident, the data structure will be simpler (after all we don't need maintain extra file pointers for a second-time seek, and we don't have to decide heuristic for how terms are clustered). And this is why those two FST-based implementation are introduced. Another big change in the code is: since our term dictionaries were both block-based, previous API was also limited. It was the postings writer who collected term metadata, and the term dictionary who told postings writer the range of terms it should flush to block. However, encoding of terms data should be decided by term dictionary part, since postings writer doesn't always know how terms are structured in term dictionary... Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had to use terms' ordinal in block to decide how to write metadata, which is unnecessary. To make the API between term dict and postings list more 'pluggable' and 'general', I refactored the PostingsReader/WriterBase. For example, the postings writer should provide some information to term dictionary, like how many metadata values are strictly monotonic, so that term dictionary can optimize delta-encoding itself. And since the term dictionary now fully decides how metadata are written, it gets the ability to utilize intblock-based metadata encoding. Now the two implementations of term dictionary can easily be plugged with current postings formats, like: FST41 = FSTTermdict + Lucene41PostingsBaseFormat, FSTOrd41 = FSTOrdTermdict + Lucene41PostingsBaseFormat. FSTOrdPulsing41 = FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat About performance, as shown before, those two term dict improve on primary key lookup, but still have overhead on wildcard query (both two term dict have only prefix information, and term dictionary cannot work well with this...). I'll try to hack this later.
          Hide
          ASF subversion and git services added a comment -

          Commit 1520422 from Michael McCandless in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1520422 ]

          LUCENE-3069: small javadoc fixes

          Show
          ASF subversion and git services added a comment - Commit 1520422 from Michael McCandless in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520422 ] LUCENE-3069 : small javadoc fixes
          Hide
          Michael McCandless added a comment -

          Patch looks great. It's nice how postings writers no longer need
          their own redundant PendingTerm instances to track the term's metadata
          / blocking; just use their existing TermState class instead. And how
          postings readers don't have to deal w/ blocking either.

          In general, couldn't the writer re-use the reader's TermState?
          E.g. Lucene40PostingsWriter just use Lucene40PostingsReader's
          StandardTermState, rather than make its own? (And same for
          Lucene41PostingsWriter/Reader).

          Have you run "first do no harm" perf tests? Ie, compare current trunk
          w/ default Codec to branch w/ default Codec? Just to make sure there
          are no surprises...

          Why does Lucene41PostingsWriter have "impersonation" code? Was that
          just for debugging during dev? Can we remove it (it should always
          write the current format)? The reader needs it of course ... but it
          shouldn't be commented as "impersonation" but as back-compat?

          In the javadocs for encodeTerm, don't we require that the long[] are
          always monotonic? It's not "optional"? Also, "monotonical" should be
          "monotonic" there.

          Maybe we should add a "reset" method to each PF's TermState, so
          instead of doing newTermState() when absolute, we can .reset(), and
          likewise in the reader.

          I forget: why does the postings reader/writer need to handle delta
          coding again (take an absolute boolean argument)? Was it because of
          pulsing or sep? It's fine for now (progress not perfection) ... but
          not clean, since "delta coding" is really an encoding detail so in
          theory the terms dict should "own" that ...

          "monotonical" appears several times but I think it should instead be
          "monotonic".

          The new .smy file for Pulsing is sort of strange ... but necessary
          since it always uses 0 longs, so we have to store this somewhere
          ... you could put it into FieldInfo attributes instead?

          It's nice how small the FST terms dicts are! Much simpler than the
          hairy BlockTree code...

          Should we backport this to 4.x? In theory this should not be so hard
          ... 3.x indices already have their own PF impls, and the change is
          back-compatible to current 4.x indices ...

          Show
          Michael McCandless added a comment - Patch looks great. It's nice how postings writers no longer need their own redundant PendingTerm instances to track the term's metadata / blocking; just use their existing TermState class instead. And how postings readers don't have to deal w/ blocking either. In general, couldn't the writer re-use the reader's TermState? E.g. Lucene40PostingsWriter just use Lucene40PostingsReader's StandardTermState, rather than make its own? (And same for Lucene41PostingsWriter/Reader). Have you run "first do no harm" perf tests? Ie, compare current trunk w/ default Codec to branch w/ default Codec? Just to make sure there are no surprises... Why does Lucene41PostingsWriter have "impersonation" code? Was that just for debugging during dev? Can we remove it (it should always write the current format)? The reader needs it of course ... but it shouldn't be commented as "impersonation" but as back-compat? In the javadocs for encodeTerm, don't we require that the long[] are always monotonic? It's not "optional"? Also, "monotonical" should be "monotonic" there. Maybe we should add a "reset" method to each PF's TermState, so instead of doing newTermState() when absolute, we can .reset(), and likewise in the reader. I forget: why does the postings reader/writer need to handle delta coding again (take an absolute boolean argument)? Was it because of pulsing or sep? It's fine for now (progress not perfection) ... but not clean, since "delta coding" is really an encoding detail so in theory the terms dict should "own" that ... "monotonical" appears several times but I think it should instead be "monotonic". The new .smy file for Pulsing is sort of strange ... but necessary since it always uses 0 longs, so we have to store this somewhere ... you could put it into FieldInfo attributes instead? It's nice how small the FST terms dicts are! Much simpler than the hairy BlockTree code... Should we backport this to 4.x? In theory this should not be so hard ... 3.x indices already have their own PF impls, and the change is back-compatible to current 4.x indices ...
          Hide
          Han Jiang added a comment -

          Mike, thanks for the review!

          In general, couldn't the writer re-use the reader's TermState?

          I'm afraid this somewhat makes codes longer? I'll make a patch to see this.

          Have you run "first do no harm" perf tests? Ie, compare current trunk
          w/ default Codec to branch w/ default Codec? Just to make sure there
          are no surprises...

          Yes, no surprise yet.

          Why does Lucene41PostingsWriter have "impersonation" code?

          Yeah, these should be removed.

          I forget: why does the postings reader/writer need to handle delta
          coding again (take an absolute boolean argument)? Was it because of
          pulsing or sep? It's fine for now (progress not perfection) ... but
          not clean, since "delta coding" is really an encoding detail so in
          theory the terms dict should "own" that ...

          Ah, yes, because of pulsing.

          This is because.. PulsingPostingsBase is more than a PostingsBaseFormat.
          It somewhat acts like a term dict, e.g. it needs to understand how terms are
          structured in one block (term No.1 uses absolute value, term No.x use delta value)
          then judge how to restruct the inlined and wrapped block (No.1 still uses absolute value,
          but the first-non-pulsed term will need absolute encoding as well).

          Without the argument 'absolute', the real term dictionary will do the delta encoding itself,
          then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to encode
          metadata values without delta-format.

          The new .smy file for Pulsing is sort of strange ... but necessary
          since it always uses 0 longs, so we have to store this somewhere
          ... you could put it into FieldInfo attributes instead?

          Yeah, it is another hairy thing... the reason is, we don't have a 'PostingsTrailer'
          for PostingsBaseFormat. Pulsing will not know the longs size for each field, until
          all the fields are consumed... and it should not write those longsSize to termsOut in close()
          since the term dictionary will use the DirTrailer hack here. (maybe every term dictionary
          should close postingsWriter first, then write field summary and close itself? I'm not sure
          though).

          Should we backport this to 4.x?

          Yeah, OK!

          Show
          Han Jiang added a comment - Mike, thanks for the review! In general, couldn't the writer re-use the reader's TermState? I'm afraid this somewhat makes codes longer? I'll make a patch to see this. Have you run "first do no harm" perf tests? Ie, compare current trunk w/ default Codec to branch w/ default Codec? Just to make sure there are no surprises... Yes, no surprise yet. Why does Lucene41PostingsWriter have "impersonation" code? Yeah, these should be removed. I forget: why does the postings reader/writer need to handle delta coding again (take an absolute boolean argument)? Was it because of pulsing or sep? It's fine for now (progress not perfection) ... but not clean, since "delta coding" is really an encoding detail so in theory the terms dict should "own" that ... Ah, yes, because of pulsing. This is because.. PulsingPostingsBase is more than a PostingsBaseFormat. It somewhat acts like a term dict, e.g. it needs to understand how terms are structured in one block (term No.1 uses absolute value, term No.x use delta value) then judge how to restruct the inlined and wrapped block (No.1 still uses absolute value, but the first-non-pulsed term will need absolute encoding as well). Without the argument 'absolute', the real term dictionary will do the delta encoding itself, then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to encode metadata values without delta-format. The new .smy file for Pulsing is sort of strange ... but necessary since it always uses 0 longs, so we have to store this somewhere ... you could put it into FieldInfo attributes instead? Yeah, it is another hairy thing... the reason is, we don't have a 'PostingsTrailer' for PostingsBaseFormat. Pulsing will not know the longs size for each field, until all the fields are consumed... and it should not write those longsSize to termsOut in close() since the term dictionary will use the DirTrailer hack here. (maybe every term dictionary should close postingsWriter first, then write field summary and close itself? I'm not sure though). Should we backport this to 4.x? Yeah, OK!
          Hide
          ASF subversion and git services added a comment -

          Commit 1520592 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1520592 ]

          LUCENE-3069: remove impersonate codes, fix typo

          Show
          ASF subversion and git services added a comment - Commit 1520592 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520592 ] LUCENE-3069 : remove impersonate codes, fix typo
          Hide
          ASF subversion and git services added a comment -

          Commit 1520618 from Han Jiang in branch 'dev/branches/lucene3069'
          [ https://svn.apache.org/r1520618 ]

          LUCENE-3069: reuse customized TermState in PBF

          Show
          ASF subversion and git services added a comment - Commit 1520618 from Han Jiang in branch 'dev/branches/lucene3069' [ https://svn.apache.org/r1520618 ] LUCENE-3069 : reuse customized TermState in PBF
          Hide
          Han Jiang added a comment -

          I think this is ready to commit to trunk now, and I'll wait for a day or two before committing it.

          Show
          Han Jiang added a comment - I think this is ready to commit to trunk now, and I'll wait for a day or two before committing it.
          Hide
          Michael McCandless added a comment -

          Thanks Han. I think we can just leave the .smy as is for now, and keep passing "boolean absolute" down. We can later improve these ...

          I think we should first land this on trunk and let jenkins chew on it for a while ... and if all seems good, then back port.

          Show
          Michael McCandless added a comment - Thanks Han. I think we can just leave the .smy as is for now, and keep passing "boolean absolute" down. We can later improve these ... I think we should first land this on trunk and let jenkins chew on it for a while ... and if all seems good, then back port.
          Hide
          ASF subversion and git services added a comment -

          Commit 1521173 from Han Jiang in branch 'dev/trunk'
          [ https://svn.apache.org/r1521173 ]

          LUCENE-3069: Lucene should have an entirely memory resident term dictionary

          Show
          ASF subversion and git services added a comment - Commit 1521173 from Han Jiang in branch 'dev/trunk' [ https://svn.apache.org/r1521173 ] LUCENE-3069 : Lucene should have an entirely memory resident term dictionary
          Hide
          Simon Willnauer added a comment -

          nice one! I am happy that this one made it in 2.5 years after opening! Great work Han!!

          Show
          Simon Willnauer added a comment - nice one! I am happy that this one made it in 2.5 years after opening! Great work Han!!
          Hide
          ASF subversion and git services added a comment -

          Commit 1530520 from Han Jiang in branch 'dev/trunk'
          [ https://svn.apache.org/r1530520 ]

          LUCENE-3069: add CHANGES, move new postingsformats to oal.codecs

          Show
          ASF subversion and git services added a comment - Commit 1530520 from Han Jiang in branch 'dev/trunk' [ https://svn.apache.org/r1530520 ] LUCENE-3069 : add CHANGES, move new postingsformats to oal.codecs
          Hide
          Michael McCandless added a comment -

          Note that all the commit messages at the end of this issue (generated by Jira's FishEye plugin I think) incorrectly state that "Han Lee" committed changes here.

          This is due to an issue in FishEye with username collision ... Han Jiang's (who really committed here) apache username is "han", but in Jira that user name belongs to Han Lee, which leads to this mis-labeling.

          Here's the INFRA issue: https://issues.apache.org/jira/browse/INFRA-3243 but it's currently WONTFIX unfortunately ...

          Show
          Michael McCandless added a comment - Note that all the commit messages at the end of this issue (generated by Jira's FishEye plugin I think) incorrectly state that "Han Lee" committed changes here. This is due to an issue in FishEye with username collision ... Han Jiang's (who really committed here) apache username is "han", but in Jira that user name belongs to Han Lee, which leads to this mis-labeling. Here's the INFRA issue: https://issues.apache.org/jira/browse/INFRA-3243 but it's currently WONTFIX unfortunately ...
          Hide
          Mark Miller added a comment -

          I think it's actually two different things - our commit messages are generated by a script that subscribes to svnpubsub, and it does some look ups to figure out the right user name. The fisheye stuff is what you see if you look at the source tab I think. So it might be easier to fix, since I think it's in our control (INFRA's anyway).

          https://svn.apache.org/repos/infra/infrastructure/trunk/projects/svngit2jira/

          Show
          Mark Miller added a comment - I think it's actually two different things - our commit messages are generated by a script that subscribes to svnpubsub, and it does some look ups to figure out the right user name. The fisheye stuff is what you see if you look at the source tab I think. So it might be easier to fix, since I think it's in our control (INFRA's anyway). https://svn.apache.org/repos/infra/infrastructure/trunk/projects/svngit2jira/
          Hide
          Michael McCandless added a comment -

          Thanks Mark.

          Yes, you can see FishEye's comments under Source and and also the All tab.

          "Our" (svnpubsub) commit messages are correct here (they say "Han Jiang"), but the FishEye comments are incorrect (they say "Han Lee").

          Show
          Michael McCandless added a comment - Thanks Mark. Yes, you can see FishEye's comments under Source and and also the All tab. "Our" (svnpubsub) commit messages are correct here (they say "Han Jiang"), but the FishEye comments are incorrect (they say "Han Lee").
          Hide
          Mark Miller added a comment -

          Ah, sorry. I'm usually in Comments view and I took "Note that all the commit messages at the end of this issue " as referring to the ASF subversion and git services commit tags. Given past experience, I don't trust the fisheye or the like integrations. We might wake up one day and they will just be gone along with their history...

          Show
          Mark Miller added a comment - Ah, sorry. I'm usually in Comments view and I took "Note that all the commit messages at the end of this issue " as referring to the ASF subversion and git services commit tags. Given past experience, I don't trust the fisheye or the like integrations. We might wake up one day and they will just be gone along with their history...
          Hide
          Han Jiang added a comment -

          Thanks for catching this Mike! I wasn't quick to get that username

          Show
          Han Jiang added a comment - Thanks for catching this Mike! I wasn't quick to get that username
          Hide
          Michael McCandless added a comment -

          I'd like to commit this to 4.7 as well ... I'll backport & commit soon.

          Show
          Michael McCandless added a comment - I'd like to commit this to 4.7 as well ... I'll backport & commit soon.
          Hide
          ASF subversion and git services added a comment -

          Commit 1562497 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1562497 ]

          LUCENE-3069: port fully RAM-resident terms FST dictionary implementations to 4.x

          Show
          ASF subversion and git services added a comment - Commit 1562497 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562497 ] LUCENE-3069 : port fully RAM-resident terms FST dictionary implementations to 4.x
          Hide
          ASF subversion and git services added a comment -

          Commit 1562498 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1562498 ]

          LUCENE-3069: move CHANGES entries under 4.7

          Show
          ASF subversion and git services added a comment - Commit 1562498 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1562498 ] LUCENE-3069 : move CHANGES entries under 4.7
          Hide
          ASF subversion and git services added a comment -

          Commit 1562506 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1562506 ]

          LUCENE-3069: merge the back-compat indices from 4.x

          Show
          ASF subversion and git services added a comment - Commit 1562506 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1562506 ] LUCENE-3069 : merge the back-compat indices from 4.x
          Hide
          Han Jiang added a comment -

          Thanks Mike!

          Show
          Han Jiang added a comment - Thanks Mike!
          Hide
          ASF subversion and git services added a comment -

          Commit 1562771 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1562771 ]

          LUCENE-3069: also exclude MockRandom from this test case

          Show
          ASF subversion and git services added a comment - Commit 1562771 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562771 ] LUCENE-3069 : also exclude MockRandom from this test case
          Hide
          Michael McCandless added a comment -

          Woops, thanks for fixing the gsoc label Han!

          Show
          Michael McCandless added a comment - Woops, thanks for fixing the gsoc label Han!
          Hide
          Han Jiang added a comment -

          Had to reopen it because jira doesn't permit label change

          Show
          Han Jiang added a comment - Had to reopen it because jira doesn't permit label change

            People

            • Assignee:
              Han Jiang
              Reporter:
              Simon Willnauer
            • Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development