Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We have pulsing codec, but currently this has some downsides:

      • its very general, wrapping an arbitrary postingsformat and pulsing everything in the postings for an arbitrary docfreq/totalTermFreq cutoff
      • reuse is hairy: because it specializes its enums based on these cutoffs, when walking thru terms e.g. merging there is a lot of sophisticated stuff to avoid the worst cases where we clone indexinputs for tons of terms.

      On the other hand the way the 4.1 codec encodes "primary key" fields is pretty silly, we write the docStartFP vlong in the term dictionary metadata, which tells us where to seek in the .doc to read our one lonely vint.

      I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just write the lone doc delta where we would write docStartFP.

      We can avoid the hairy reuse problem too, by just supporting this in refillDocs() in BlockDocsEnum instead of specializing.

      This would remove the additional seek for "primary key" fields without really any of the downsides of pulsing today.

      1. LUCENE-4498_lazy.patch
        12 kB
        Robert Muir
      2. LUCENE-4498.patch
        13 kB
        Robert Muir
      3. LUCENE-4498.patch
        12 kB
        Robert Muir
      4. LUCENE-4498.patch
        10 kB
        Robert Muir
      5. LUCENE-4498.patch
        9 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I will work on a patch after LUCENE-4497 has been reviewed... ive already conflicted myself with this PF today

        Show
        Robert Muir added a comment - I will work on a patch after LUCENE-4497 has been reviewed... ive already conflicted myself with this PF today
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Robert Muir added a comment -

        Actually I think for the other cases (not just DOCS_ONLY) we can pulse when totalTermFreq=1, as the freq is implicit.
        We can just leave the positions and what not where they are.

        I'll see how ugly it is...

        Show
        Robert Muir added a comment - Actually I think for the other cases (not just DOCS_ONLY) we can pulse when totalTermFreq=1, as the freq is implicit. We can just leave the positions and what not where they are. I'll see how ugly it is...
        Hide
        Robert Muir added a comment -

        Initial patch (no file format docs yet, lets benchmark/measure first).

        All tests pass.

        Show
        Robert Muir added a comment - Initial patch (no file format docs yet, lets benchmark/measure first). All tests pass.
        Hide
        Robert Muir added a comment -

        duh I forgot to actually not seek in the previous patch: here's the updated patch.

        Show
        Robert Muir added a comment - duh I forgot to actually not seek in the previous patch: here's the updated patch.
        Hide
        Robert Muir added a comment -

        Here is a patch with a lazy clone() of the docsenum, e.g. when someone isnt reusing docsenum like doing termqueries or whatever, they won't pay the price of NIOFS buffer reads etc just for a primary key.

        Show
        Robert Muir added a comment - Here is a patch with a lazy clone() of the docsenum, e.g. when someone isnt reusing docsenum like doing termqueries or whatever, they won't pay the price of NIOFS buffer reads etc just for a primary key.
        Hide
        Michael McCandless added a comment -

        Looks good:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell       86.70      (3.0%)       84.04      (2.6%)   -3.1% (  -8% -    2%)
                       OrHighMed       41.52      (5.8%)       40.44      (6.1%)   -2.6% ( -13% -    9%)
                       OrHighLow       25.43      (6.0%)       24.77      (6.4%)   -2.6% ( -14% -   10%)
                      OrHighHigh        9.38      (5.9%)        9.15      (6.4%)   -2.5% ( -14% -   10%)
                        Wildcard       93.94      (4.1%)       92.36      (2.0%)   -1.7% (  -7% -    4%)
                         MedTerm      211.10     (12.3%)      208.78     (13.4%)   -1.1% ( -23% -   27%)
                          IntNRQ       10.74     (11.3%)       10.62      (7.8%)   -1.1% ( -18% -   20%)
                        HighTerm       25.59     (14.0%)       25.35     (15.0%)   -1.0% ( -26% -   32%)
                     MedSpanNear       13.77      (2.3%)       13.68      (1.6%)   -0.7% (  -4% -    3%)
                HighSloppyPhrase        4.09      (5.4%)        4.07      (5.2%)   -0.5% ( -10% -   10%)
                    HighSpanNear        6.84      (2.9%)        6.81      (2.1%)   -0.4% (  -5% -    4%)
                         Prefix3       17.81      (5.7%)       17.74      (1.5%)   -0.4% (  -7% -    7%)
                          Fuzzy1       77.54      (2.5%)       77.25      (2.7%)   -0.4% (  -5% -    4%)
                      AndHighLow      719.17      (2.7%)      716.49      (2.3%)   -0.4% (  -5% -    4%)
                          Fuzzy2       68.94      (2.4%)       68.69      (2.8%)   -0.4% (  -5% -    5%)
                     LowSpanNear       12.89      (1.8%)       12.85      (1.3%)   -0.3% (  -3% -    2%)
                 MedSloppyPhrase       29.92      (3.4%)       29.85      (3.4%)   -0.2% (  -6% -    6%)
                         LowTerm      500.58      (5.9%)      500.52      (7.0%)   -0.0% ( -12% -   13%)
                 LowSloppyPhrase        9.57      (4.4%)        9.60      (4.3%)    0.4% (  -7% -    9%)
                       LowPhrase        9.64      (2.8%)        9.70      (3.0%)    0.7% (  -4% -    6%)
                      AndHighMed       86.68      (1.2%)       87.26      (1.2%)    0.7% (  -1% -    3%)
                       MedPhrase        7.07      (4.3%)        7.15      (4.6%)    1.1% (  -7% -   10%)
                      HighPhrase        4.79      (4.8%)        4.84      (5.6%)    1.1% (  -8% -   12%)
                     AndHighHigh       25.81      (1.7%)       26.20      (1.2%)    1.5% (  -1% -    4%)
                        PKLookup      193.31      (2.1%)      204.74      (1.6%)    5.9% (   2% -    9%)
        
        Show
        Michael McCandless added a comment - Looks good: Task QPS base StdDev QPS comp StdDev Pct diff Respell 86.70 (3.0%) 84.04 (2.6%) -3.1% ( -8% - 2%) OrHighMed 41.52 (5.8%) 40.44 (6.1%) -2.6% ( -13% - 9%) OrHighLow 25.43 (6.0%) 24.77 (6.4%) -2.6% ( -14% - 10%) OrHighHigh 9.38 (5.9%) 9.15 (6.4%) -2.5% ( -14% - 10%) Wildcard 93.94 (4.1%) 92.36 (2.0%) -1.7% ( -7% - 4%) MedTerm 211.10 (12.3%) 208.78 (13.4%) -1.1% ( -23% - 27%) IntNRQ 10.74 (11.3%) 10.62 (7.8%) -1.1% ( -18% - 20%) HighTerm 25.59 (14.0%) 25.35 (15.0%) -1.0% ( -26% - 32%) MedSpanNear 13.77 (2.3%) 13.68 (1.6%) -0.7% ( -4% - 3%) HighSloppyPhrase 4.09 (5.4%) 4.07 (5.2%) -0.5% ( -10% - 10%) HighSpanNear 6.84 (2.9%) 6.81 (2.1%) -0.4% ( -5% - 4%) Prefix3 17.81 (5.7%) 17.74 (1.5%) -0.4% ( -7% - 7%) Fuzzy1 77.54 (2.5%) 77.25 (2.7%) -0.4% ( -5% - 4%) AndHighLow 719.17 (2.7%) 716.49 (2.3%) -0.4% ( -5% - 4%) Fuzzy2 68.94 (2.4%) 68.69 (2.8%) -0.4% ( -5% - 5%) LowSpanNear 12.89 (1.8%) 12.85 (1.3%) -0.3% ( -3% - 2%) MedSloppyPhrase 29.92 (3.4%) 29.85 (3.4%) -0.2% ( -6% - 6%) LowTerm 500.58 (5.9%) 500.52 (7.0%) -0.0% ( -12% - 13%) LowSloppyPhrase 9.57 (4.4%) 9.60 (4.3%) 0.4% ( -7% - 9%) LowPhrase 9.64 (2.8%) 9.70 (3.0%) 0.7% ( -4% - 6%) AndHighMed 86.68 (1.2%) 87.26 (1.2%) 0.7% ( -1% - 3%) MedPhrase 7.07 (4.3%) 7.15 (4.6%) 1.1% ( -7% - 10%) HighPhrase 4.79 (4.8%) 4.84 (5.6%) 1.1% ( -8% - 12%) AndHighHigh 25.81 (1.7%) 26.20 (1.2%) 1.5% ( -1% - 4%) PKLookup 193.31 (2.1%) 204.74 (1.6%) 5.9% ( 2% - 9%)
        Hide
        Robert Muir added a comment -

        This code can be simplified and generalized a bit. basically it just needs to be docFreq == 1. in this case totalTermFreq is redundant for freq,
        so we can e.g. pulse a term that appears 5 times but only in one doc.

        I'll update the patch again.

        Show
        Robert Muir added a comment - This code can be simplified and generalized a bit. basically it just needs to be docFreq == 1. in this case totalTermFreq is redundant for freq, so we can e.g. pulse a term that appears 5 times but only in one doc. I'll update the patch again.
        Hide
        Robert Muir added a comment -

        here's the docFreq=1 patch. I like this a lot better, i dont think it really buys us much but just makes the code simpler and easier to understand.

        Show
        Robert Muir added a comment - here's the docFreq=1 patch. I like this a lot better, i dont think it really buys us much but just makes the code simpler and easier to understand.
        Hide
        Robert Muir added a comment -

        patch with file format docs and comment fixes.

        I think this is ready to go.

        Show
        Robert Muir added a comment - patch with file format docs and comment fixes. I think this is ready to go.
        Hide
        Michael McCandless added a comment -

        +1

        Very nice to fold pulsing into the default PF!

        Show
        Michael McCandless added a comment - +1 Very nice to fold pulsing into the default PF!
        Hide
        Robert Muir added a comment -

        Committed to trunk. will give that flonkings builder some time...

        Show
        Robert Muir added a comment - Committed to trunk. will give that flonkings builder some time...
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1401421

        LUCENE-4498: pulse docFreq=1 in 4.1 codec

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1401421 LUCENE-4498 : pulse docFreq=1 in 4.1 codec

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development