Lucene - Core
  1. Lucene - Core
  2. LUCENE-5819

Add block tree postings format that supports term ords

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      BlockTree is our default terms dictionary today, but it doesn't
      support term ords, which is an optional API in the postings format to
      retrieve the ordinal for the currently seek'd term, and also later
      seek by that ordinal e.g. to lookup the term.

      This can possibly be useful for e.g. faceting, and maybe at some point
      we can share the postings terms dict with the one used by sorted/set
      DV for cases when app wants to invert and facet on a given field.

      The older (3.x) block terms dict can easily support ords, and we have
      a Lucene41OrdsPF in test-framework, but it's not as fast / compact as
      block-tree, and doesn't (can't easily) implement an optimized
      intersect, but it could be for fields we'd want to facet on, these
      tradeoffs don't matter. It's nice to have options...

      1. LUCENE-5819.patch
        231 kB
        Michael McCandless
      2. LUCENE-5819.patch
        231 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        I put the new terms dict in lucene/codecs, tied it into MockRandomPF,
        and improved BasePostingsFormatTestCase to test terms ords when they
        are supported. Tests seem to be passing, even test-core with -Dtests.pf=OrdsLucene41.

        I also made a change to CheckIndex, which may be controversial, to
        have a "fail fast" option so it just throws the first exception it
        hits. I found this really useful when debugging because I could
        immediately see the exception causing a failure vs. the default
        behavior of CheckIndex to keep trying on the next segment. If people
        disagree w/ this I can separate it out / revert it..

        Also I realized no classes in the new IDVPF are in fact public
        (woops!); I'll commit that separately.

        Show
        Michael McCandless added a comment - Patch. I put the new terms dict in lucene/codecs, tied it into MockRandomPF, and improved BasePostingsFormatTestCase to test terms ords when they are supported. Tests seem to be passing, even test-core with -Dtests.pf=OrdsLucene41. I also made a change to CheckIndex, which may be controversial, to have a "fail fast" option so it just throws the first exception it hits. I found this really useful when debugging because I could immediately see the exception causing a failure vs. the default behavior of CheckIndex to keep trying on the next segment. If people disagree w/ this I can separate it out / revert it.. Also I realized no classes in the new IDVPF are in fact public (woops!); I'll commit that separately.
        Hide
        Michael McCandless added a comment -

        I ran a quick perf test of Lucene41 vs OrdsLucene41, on wikimediumall:

        Report after iter 19:
                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                        PKLookup      153.33      (8.7%)      131.17      (8.5%)  -14.4% ( -29% -    3%)
                         Respell       35.40      (5.4%)       31.41      (7.9%)  -11.3% ( -23% -    2%)
                      AndHighLow      241.05      (3.3%)      224.00     (14.7%)   -7.1% ( -24% -   11%)
                          Fuzzy2       69.73      (6.3%)       65.30      (5.5%)   -6.3% ( -17% -    5%)
                          Fuzzy1       44.32      (9.4%)       41.90     (11.8%)   -5.5% ( -24% -   17%)
                         LowTerm      313.68      (2.4%)      296.93     (10.8%)   -5.3% ( -18% -    8%)
                        Wildcard       39.40      (5.7%)       37.35      (9.7%)   -5.2% ( -19% -   10%)
                          IntNRQ        3.57      (9.3%)        3.41     (14.5%)   -4.6% ( -26% -   21%)
                 MedSloppyPhrase        4.98      (3.3%)        4.76     (12.7%)   -4.4% ( -19% -   12%)
                       MedPhrase        6.18      (3.8%)        5.95     (13.1%)   -3.7% ( -19% -   13%)
                        HighTerm       27.78      (5.8%)       26.75     (10.1%)   -3.7% ( -18% -   12%)
                     AndHighHigh       13.51      (2.0%)       13.02      (9.9%)   -3.6% ( -15% -    8%)
                 LowSloppyPhrase      134.71      (3.3%)      130.50     (12.1%)   -3.1% ( -17% -   12%)
                         Prefix3        8.88      (9.7%)        8.65     (15.6%)   -2.7% ( -25% -   25%)
                       LowPhrase       49.67      (3.1%)       48.38     (11.4%)   -2.6% ( -16% -   12%)
                         MedTerm      117.97      (4.5%)      115.01      (6.9%)   -2.5% ( -13% -    9%)
                      HighPhrase        7.87      (6.0%)        7.73     (13.3%)   -1.8% ( -19% -   18%)
                    HighSpanNear        4.68      (6.6%)        4.61     (14.7%)   -1.4% ( -21% -   21%)
                      AndHighMed       49.48      (1.6%)       48.95      (5.0%)   -1.1% (  -7% -    5%)
                     LowSpanNear       23.70      (4.6%)       23.55     (10.4%)   -0.7% ( -14% -   15%)
                HighSloppyPhrase        5.90      (4.4%)        5.87     (11.2%)   -0.5% ( -15% -   15%)
                    OrNotHighLow       36.90     (12.3%)       37.07     (12.9%)    0.5% ( -22% -   29%)
                      OrHighHigh        4.16     (15.2%)        4.19     (16.7%)    0.8% ( -27% -   38%)
                   OrHighNotHigh       11.86     (13.8%)       11.98     (18.4%)    0.9% ( -27% -   38%)
                     MedSpanNear        4.32      (5.3%)        4.39     (10.7%)    1.5% ( -13% -   18%)
                    OrHighNotMed       26.10     (14.7%)       26.60     (12.8%)    1.9% ( -22% -   34%)
                    OrHighNotLow       19.61     (15.8%)       20.08     (13.9%)    2.4% ( -23% -   38%)
                    OrNotHighMed       13.84     (15.9%)       14.19     (16.7%)    2.6% ( -25% -   41%)
                       OrHighMed       27.09     (18.5%)       27.87     (19.4%)    2.9% ( -29% -   50%)
                       OrHighLow       36.24     (15.4%)       37.42     (15.3%)    3.2% ( -23% -   40%)
                   OrNotHighHigh        9.70     (16.6%)       10.11     (15.5%)    4.2% ( -23% -   43%)
        

        Net/net the terms-dict heavy operations (PKLookup, respell, fuzzy,
        maybe IntNRQ) take some hit, since there is added cost to decode
        ordinals from the FST; I think the other changes are likely noise.

        Also, the net terms index (size of FSTs that are loaded into RAM,
        *.tip/*.tipo) grew from 31M to 46M (~48% larger)...

        Show
        Michael McCandless added a comment - I ran a quick perf test of Lucene41 vs OrdsLucene41, on wikimediumall: Report after iter 19: Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 153.33 (8.7%) 131.17 (8.5%) -14.4% ( -29% - 3%) Respell 35.40 (5.4%) 31.41 (7.9%) -11.3% ( -23% - 2%) AndHighLow 241.05 (3.3%) 224.00 (14.7%) -7.1% ( -24% - 11%) Fuzzy2 69.73 (6.3%) 65.30 (5.5%) -6.3% ( -17% - 5%) Fuzzy1 44.32 (9.4%) 41.90 (11.8%) -5.5% ( -24% - 17%) LowTerm 313.68 (2.4%) 296.93 (10.8%) -5.3% ( -18% - 8%) Wildcard 39.40 (5.7%) 37.35 (9.7%) -5.2% ( -19% - 10%) IntNRQ 3.57 (9.3%) 3.41 (14.5%) -4.6% ( -26% - 21%) MedSloppyPhrase 4.98 (3.3%) 4.76 (12.7%) -4.4% ( -19% - 12%) MedPhrase 6.18 (3.8%) 5.95 (13.1%) -3.7% ( -19% - 13%) HighTerm 27.78 (5.8%) 26.75 (10.1%) -3.7% ( -18% - 12%) AndHighHigh 13.51 (2.0%) 13.02 (9.9%) -3.6% ( -15% - 8%) LowSloppyPhrase 134.71 (3.3%) 130.50 (12.1%) -3.1% ( -17% - 12%) Prefix3 8.88 (9.7%) 8.65 (15.6%) -2.7% ( -25% - 25%) LowPhrase 49.67 (3.1%) 48.38 (11.4%) -2.6% ( -16% - 12%) MedTerm 117.97 (4.5%) 115.01 (6.9%) -2.5% ( -13% - 9%) HighPhrase 7.87 (6.0%) 7.73 (13.3%) -1.8% ( -19% - 18%) HighSpanNear 4.68 (6.6%) 4.61 (14.7%) -1.4% ( -21% - 21%) AndHighMed 49.48 (1.6%) 48.95 (5.0%) -1.1% ( -7% - 5%) LowSpanNear 23.70 (4.6%) 23.55 (10.4%) -0.7% ( -14% - 15%) HighSloppyPhrase 5.90 (4.4%) 5.87 (11.2%) -0.5% ( -15% - 15%) OrNotHighLow 36.90 (12.3%) 37.07 (12.9%) 0.5% ( -22% - 29%) OrHighHigh 4.16 (15.2%) 4.19 (16.7%) 0.8% ( -27% - 38%) OrHighNotHigh 11.86 (13.8%) 11.98 (18.4%) 0.9% ( -27% - 38%) MedSpanNear 4.32 (5.3%) 4.39 (10.7%) 1.5% ( -13% - 18%) OrHighNotMed 26.10 (14.7%) 26.60 (12.8%) 1.9% ( -22% - 34%) OrHighNotLow 19.61 (15.8%) 20.08 (13.9%) 2.4% ( -23% - 38%) OrNotHighMed 13.84 (15.9%) 14.19 (16.7%) 2.6% ( -25% - 41%) OrHighMed 27.09 (18.5%) 27.87 (19.4%) 2.9% ( -29% - 50%) OrHighLow 36.24 (15.4%) 37.42 (15.3%) 3.2% ( -23% - 40%) OrNotHighHigh 9.70 (16.6%) 10.11 (15.5%) 4.2% ( -23% - 43%) Net/net the terms-dict heavy operations (PKLookup, respell, fuzzy, maybe IntNRQ) take some hit, since there is added cost to decode ordinals from the FST; I think the other changes are likely noise. Also, the net terms index (size of FSTs that are loaded into RAM, *.tip/*.tipo) grew from 31M to 46M (~48% larger)...
        Hide
        Michael McCandless added a comment -

        New patch, fixes last nocommit, fixes ant precommit ... I think it's ready.

        Show
        Michael McCandless added a comment - New patch, fixes last nocommit, fixes ant precommit ... I think it's ready.
        Hide
        Michael McCandless added a comment -

        The gist of the change here is that the terms index FST, via a new
        custom Outputs impl FSTOrdsOutputs, now also stores the start and end
        ord range for each block. The end ord is also necessary because the
        terms don't neatly fall into just the leaf blocks: "straggler" terms
        can easily fall inside inner blocks, and in this case we need the end
        ord of the lower blocks to realize the term is a "straggler".

        The on-disk blocks themselves are nearly the same; the only difference
        is when a block writes a pointer to a sub-block, it now also writes
        (vlong) how many terms are in that sub-block. This way when we are
        seeking by ord and skip that sub-block we know how many ords were just
        skipped.

        I made a custom getByOutput to handle the ranges, falling back to the
        last range that included the target ord while recursing.

        Otherwise the terms dict is basically the same as the normal block
        tree, including optimized intersect (w/o ord() implemented: not sure
        we need it), except all seek/next operations also compute the term
        ord. Floor blocks also store the term ord each one starts on.

        Show
        Michael McCandless added a comment - The gist of the change here is that the terms index FST, via a new custom Outputs impl FSTOrdsOutputs, now also stores the start and end ord range for each block. The end ord is also necessary because the terms don't neatly fall into just the leaf blocks: "straggler" terms can easily fall inside inner blocks, and in this case we need the end ord of the lower blocks to realize the term is a "straggler". The on-disk blocks themselves are nearly the same; the only difference is when a block writes a pointer to a sub-block, it now also writes (vlong) how many terms are in that sub-block. This way when we are seeking by ord and skip that sub-block we know how many ords were just skipped. I made a custom getByOutput to handle the ranges, falling back to the last range that included the target ord while recursing. Otherwise the terms dict is basically the same as the normal block tree, including optimized intersect (w/o ord() implemented: not sure we need it), except all seek/next operations also compute the term ord. Floor blocks also store the term ord each one starts on.
        Hide
        Robert Muir added a comment -

        Why is FSTOrdsOutput just treating the rest of the output as a byte array? Wont this be ineffective here (e.g. we have pointer to block in terms dict, what else is in there?) Do we really need a generic solution or should this just have its own output geared at what it does?

        Show
        Robert Muir added a comment - Why is FSTOrdsOutput just treating the rest of the output as a byte array? Wont this be ineffective here (e.g. we have pointer to block in terms dict, what else is in there?) Do we really need a generic solution or should this just have its own output geared at what it does?
        Hide
        Michael McCandless added a comment -

        Do we really need a generic solution or should this just have its own output geared at what it does?

        Well, really I just preserved that from the current block tree impl. That byte[] has the pointer to the block, plus a couple flags (isLeafBlock/isFloorBlock) and then any "floor block" index data (file pointer offsets to get to the floor blocks). I agree at Outputs impl that broke these things out would be nicer ... but I think this should probably be done separately.

        Show
        Michael McCandless added a comment - Do we really need a generic solution or should this just have its own output geared at what it does? Well, really I just preserved that from the current block tree impl. That byte[] has the pointer to the block, plus a couple flags (isLeafBlock/isFloorBlock) and then any "floor block" index data (file pointer offsets to get to the floor blocks). I agree at Outputs impl that broke these things out would be nicer ... but I think this should probably be done separately.
        Hide
        ASF subversion and git services added a comment -

        Commit 1612080 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1612080 ]

        LUCENE-5819: add terms dict and postings format that implement term ordinals

        Show
        ASF subversion and git services added a comment - Commit 1612080 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1612080 ] LUCENE-5819 : add terms dict and postings format that implement term ordinals
        Hide
        ASF subversion and git services added a comment -

        Commit 1612213 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1612213 ]

        LUCENE-5819: add terms dict and postings format that implement term ordinals

        Show
        ASF subversion and git services added a comment - Commit 1612213 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612213 ] LUCENE-5819 : add terms dict and postings format that implement term ordinals
        Hide
        ASF subversion and git services added a comment -

        Commit 1612217 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1612217 ]

        LUCENE-5819: fix ord bug; add test case; remove dead code

        Show
        ASF subversion and git services added a comment - Commit 1612217 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1612217 ] LUCENE-5819 : fix ord bug; add test case; remove dead code

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development