[LUCENE-5308] explore per-dimension fixed-width ordinal encoding - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/facet
Labels:
None

Lucene Fields:

New

Description

I've been testing performance of Solr vs Lucene facets, and one area
where Solr's "fcs" method shines (low RAM, high faceting perf) is in
low-cardinality dimensions.

I suspect the gains are because with the field-cache entries the ords
are encoded in "column-stride" form, and are private to that dim (vs
facet module's shared ord space).

So I thought about whether we could do something like this in the
facet module ...

I.e., if we know certain documents will have a specific set of
single-valued dimensions, we can pick an encoding format for the
per-doc byte[] "globally" for all such documents, and use private ord
space per-dimension to improve compression.

The basic idea is to pre-assign up-front (before the segment is
written) which bytes belong to which dim. E.g., date takes bytes 0-1
(<= than 65536 unique labels), imageCount takes byte 2 (<= 256
unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
etc. This only works for single-valued dims, and only works if all
docs (or at least an identifiable subset?) have all dims.

To test this idea, I made a hacked up prototype patch; it has tons of
limitations so we clearly can't commit it, but I was able to test full
wikipedia en with 6 facet dims (date, username, refCount, imageCount,
sectionCount, subSectionCount, subSubSectionCount).

Trunk (base) requires 181 MB of net doc values to hold the facet ords,
while the patch requires 183 MB.

Perf:

Report after iter 19:
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                 Respell       54.30      (3.1%)       54.02      (2.7%)   -0.5% (  -6% -    5%)
         MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    0.6% ( -10% -   12%)
            OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    0.7% ( -12% -   15%)
        HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    1.1% ( -12% -   16%)
             LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    1.8% (  -6% -   10%)
               LowPhrase       12.15      (6.4%)       12.43      (7.2%)    2.3% ( -10% -   17%)
              AndHighLow      402.54      (1.4%)      425.23      (2.3%)    5.6% (   1% -    9%)
         LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    6.3% (   2% -    9%)
             MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    7.0% (   0% -   13%)
              HighPhrase        4.01      (8.1%)        4.30      (9.7%)    7.4% (  -9% -   27%)
                  Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    7.8% (   3% -   12%)
            OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    7.9% (  -2% -   19%)
                  Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    8.2% (   4% -   12%)
               MedPhrase      129.06      (4.9%)      141.14      (6.2%)    9.4% (  -1% -   21%)
              AndHighMed       27.71      (0.7%)       30.32      (1.1%)    9.4% (   7% -   11%)
            HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    9.5% (   1% -   17%)
             AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   11.7% (   9% -   13%)
           OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   18.3% (  13% -   23%)
                Wildcard        9.06      (1.4%)       10.85      (2.6%)   19.8% (  15% -   24%)
           OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   20.3% (  15% -   25%)
              OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   20.9% (  16% -   25%)
               OrHighLow        5.22      (1.5%)        6.34      (2.5%)   21.4% (  17% -   25%)
            OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   22.5% (  18% -   26%)
                 Prefix3       27.61      (1.2%)       33.90      (2.3%)   22.8% (  19% -   26%)
               OrHighMed       11.72      (1.6%)       14.56      (2.3%)   24.3% (  20% -   28%)
            OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   24.5% (  20% -   28%)
                 MedTerm       26.37      (1.2%)       32.85      (2.7%)   24.6% (  20% -   28%)
                  IntNRQ        2.61      (1.2%)        3.25      (3.0%)   24.7% (  20% -   29%)
                HighTerm       19.69      (1.3%)       25.33      (3.0%)   28.7% (  23% -   33%)
                 LowTerm      131.50      (1.3%)      170.49      (3.0%)   29.7% (  25% -   34%)

I think the gains are sizable, and the increase in index size quite
minor (in another test with fewer dims I saw the index size get a bit
smaller) ... at least for this specific test.

However, finding a clean solution here will be tricky...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5308.patch
26/Oct/13 18:27
11 kB
Michael McCandless

explore per-dimension fixed-width ordinal encoding

Details

Description

Attachments

Attachments

Activity

People

Dates