Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5308

explore per-dimension fixed-width ordinal encoding

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • modules/facet
    • None
    • New

    Description

      I've been testing performance of Solr vs Lucene facets, and one area
      where Solr's "fcs" method shines (low RAM, high faceting perf) is in
      low-cardinality dimensions.

      I suspect the gains are because with the field-cache entries the ords
      are encoded in "column-stride" form, and are private to that dim (vs
      facet module's shared ord space).

      So I thought about whether we could do something like this in the
      facet module ...

      I.e., if we know certain documents will have a specific set of
      single-valued dimensions, we can pick an encoding format for the
      per-doc byte[] "globally" for all such documents, and use private ord
      space per-dimension to improve compression.

      The basic idea is to pre-assign up-front (before the segment is
      written) which bytes belong to which dim. E.g., date takes bytes 0-1
      (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
      unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
      etc. This only works for single-valued dims, and only works if all
      docs (or at least an identifiable subset?) have all dims.

      To test this idea, I made a hacked up prototype patch; it has tons of
      limitations so we clearly can't commit it, but I was able to test full
      wikipedia en with 6 facet dims (date, username, refCount, imageCount,
      sectionCount, subSectionCount, subSubSectionCount).

      Trunk (base) requires 181 MB of net doc values to hold the facet ords,
      while the patch requires 183 MB.

      Perf:

      Report after iter 19:
                          Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                       Respell       54.30      (3.1%)       54.02      (2.7%)   -0.5% (  -6% -    5%)
               MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    0.6% ( -10% -   12%)
                  OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    0.7% ( -12% -   15%)
              HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    1.1% ( -12% -   16%)
                   LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    1.8% (  -6% -   10%)
                     LowPhrase       12.15      (6.4%)       12.43      (7.2%)    2.3% ( -10% -   17%)
                    AndHighLow      402.54      (1.4%)      425.23      (2.3%)    5.6% (   1% -    9%)
               LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    6.3% (   2% -    9%)
                   MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    7.0% (   0% -   13%)
                    HighPhrase        4.01      (8.1%)        4.30      (9.7%)    7.4% (  -9% -   27%)
                        Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    7.8% (   3% -   12%)
                  OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    7.9% (  -2% -   19%)
                        Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    8.2% (   4% -   12%)
                     MedPhrase      129.06      (4.9%)      141.14      (6.2%)    9.4% (  -1% -   21%)
                    AndHighMed       27.71      (0.7%)       30.32      (1.1%)    9.4% (   7% -   11%)
                  HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    9.5% (   1% -   17%)
                   AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   11.7% (   9% -   13%)
                 OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   18.3% (  13% -   23%)
                      Wildcard        9.06      (1.4%)       10.85      (2.6%)   19.8% (  15% -   24%)
                 OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   20.3% (  15% -   25%)
                    OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   20.9% (  16% -   25%)
                     OrHighLow        5.22      (1.5%)        6.34      (2.5%)   21.4% (  17% -   25%)
                  OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   22.5% (  18% -   26%)
                       Prefix3       27.61      (1.2%)       33.90      (2.3%)   22.8% (  19% -   26%)
                     OrHighMed       11.72      (1.6%)       14.56      (2.3%)   24.3% (  20% -   28%)
                  OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   24.5% (  20% -   28%)
                       MedTerm       26.37      (1.2%)       32.85      (2.7%)   24.6% (  20% -   28%)
                        IntNRQ        2.61      (1.2%)        3.25      (3.0%)   24.7% (  20% -   29%)
                      HighTerm       19.69      (1.3%)       25.33      (3.0%)   28.7% (  23% -   33%)
                       LowTerm      131.50      (1.3%)      170.49      (3.0%)   29.7% (  25% -   34%)
      

      I think the gains are sizable, and the increase in index size quite
      minor (in another test with fewer dims I saw the index size get a bit
      smaller) ... at least for this specific test.

      However, finding a clean solution here will be tricky...

      Attachments

        1. LUCENE-5308.patch
          11 kB
          Michael McCandless

        Activity

          People

            Unassigned Unassigned
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: