Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
New
Description
I've been testing performance of Solr vs Lucene facets, and one area
where Solr's "fcs" method shines (low RAM, high faceting perf) is in
low-cardinality dimensions.
I suspect the gains are because with the field-cache entries the ords
are encoded in "column-stride" form, and are private to that dim (vs
facet module's shared ord space).
So I thought about whether we could do something like this in the
facet module ...
I.e., if we know certain documents will have a specific set of
single-valued dimensions, we can pick an encoding format for the
per-doc byte[] "globally" for all such documents, and use private ord
space per-dimension to improve compression.
The basic idea is to pre-assign up-front (before the segment is
written) which bytes belong to which dim. E.g., date takes bytes 0-1
(<= than 65536 unique labels), imageCount takes byte 2 (<= 256
unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
etc. This only works for single-valued dims, and only works if all
docs (or at least an identifiable subset?) have all dims.
To test this idea, I made a hacked up prototype patch; it has tons of
limitations so we clearly can't commit it, but I was able to test full
wikipedia en with 6 facet dims (date, username, refCount, imageCount,
sectionCount, subSectionCount, subSubSectionCount).
Trunk (base) requires 181 MB of net doc values to hold the facet ords,
while the patch requires 183 MB.
Perf:
Report after iter 19: Task QPS base StdDev QPS comp StdDev Pct diff Respell 54.30 (3.1%) 54.02 (2.7%) -0.5% ( -6% - 5%) MedSloppyPhrase 3.58 (5.6%) 3.60 (6.0%) 0.6% ( -10% - 12%) OrNotHighLow 63.58 (6.8%) 64.03 (6.9%) 0.7% ( -12% - 15%) HighSloppyPhrase 3.80 (7.4%) 3.84 (7.1%) 1.1% ( -12% - 16%) LowSpanNear 8.93 (3.5%) 9.09 (4.6%) 1.8% ( -6% - 10%) LowPhrase 12.15 (6.4%) 12.43 (7.2%) 2.3% ( -10% - 17%) AndHighLow 402.54 (1.4%) 425.23 (2.3%) 5.6% ( 1% - 9%) LowSloppyPhrase 39.53 (1.6%) 42.01 (1.9%) 6.3% ( 2% - 9%) MedSpanNear 26.54 (2.8%) 28.39 (3.6%) 7.0% ( 0% - 13%) HighPhrase 4.01 (8.1%) 4.30 (9.7%) 7.4% ( -9% - 27%) Fuzzy2 44.01 (2.3%) 47.43 (1.8%) 7.8% ( 3% - 12%) OrNotHighMed 32.64 (4.7%) 35.22 (5.5%) 7.9% ( -2% - 19%) Fuzzy1 62.24 (2.1%) 67.35 (1.9%) 8.2% ( 4% - 12%) MedPhrase 129.06 (4.9%) 141.14 (6.2%) 9.4% ( -1% - 21%) AndHighMed 27.71 (0.7%) 30.32 (1.1%) 9.4% ( 7% - 11%) HighSpanNear 5.15 (3.5%) 5.63 (4.2%) 9.5% ( 1% - 17%) AndHighHigh 24.98 (0.7%) 27.89 (1.1%) 11.7% ( 9% - 13%) OrNotHighHigh 15.13 (2.0%) 17.90 (2.6%) 18.3% ( 13% - 23%) Wildcard 9.06 (1.4%) 10.85 (2.6%) 19.8% ( 15% - 24%) OrHighNotHigh 8.84 (1.8%) 10.64 (2.6%) 20.3% ( 15% - 25%) OrHighHigh 3.73 (1.6%) 4.51 (2.4%) 20.9% ( 16% - 25%) OrHighLow 5.22 (1.5%) 6.34 (2.5%) 21.4% ( 17% - 25%) OrHighNotLow 8.94 (1.6%) 10.95 (2.5%) 22.5% ( 18% - 26%) Prefix3 27.61 (1.2%) 33.90 (2.3%) 22.8% ( 19% - 26%) OrHighMed 11.72 (1.6%) 14.56 (2.3%) 24.3% ( 20% - 28%) OrHighNotMed 14.74 (1.5%) 18.34 (2.2%) 24.5% ( 20% - 28%) MedTerm 26.37 (1.2%) 32.85 (2.7%) 24.6% ( 20% - 28%) IntNRQ 2.61 (1.2%) 3.25 (3.0%) 24.7% ( 20% - 29%) HighTerm 19.69 (1.3%) 25.33 (3.0%) 28.7% ( 23% - 33%) LowTerm 131.50 (1.3%) 170.49 (3.0%) 29.7% ( 25% - 34%)
I think the gains are sizable, and the increase in index size quite
minor (in another test with fewer dims I saw the index size get a bit
smaller) ... at least for this specific test.
However, finding a clean solution here will be tricky...