Affects Version/s: None
Fix Version/s: None
I've been testing performance of Solr vs Lucene facets, and one area
where Solr's "fcs" method shines (low RAM, high faceting perf) is in
I suspect the gains are because with the field-cache entries the ords
are encoded in "column-stride" form, and are private to that dim (vs
facet module's shared ord space).
So I thought about whether we could do something like this in the
facet module ...
I.e., if we know certain documents will have a specific set of
single-valued dimensions, we can pick an encoding format for the
per-doc byte "globally" for all such documents, and use private ord
space per-dimension to improve compression.
The basic idea is to pre-assign up-front (before the segment is
written) which bytes belong to which dim. E.g., date takes bytes 0-1
(<= than 65536 unique labels), imageCount takes byte 2 (<= 256
unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
etc. This only works for single-valued dims, and only works if all
docs (or at least an identifiable subset?) have all dims.
To test this idea, I made a hacked up prototype patch; it has tons of
limitations so we clearly can't commit it, but I was able to test full
wikipedia en with 6 facet dims (date, username, refCount, imageCount,
sectionCount, subSectionCount, subSubSectionCount).
Trunk (base) requires 181 MB of net doc values to hold the facet ords,
while the patch requires 183 MB.
I think the gains are sizable, and the increase in index size quite
minor (in another test with fewer dims I saw the index size get a bit
smaller) ... at least for this specific test.
However, finding a clean solution here will be tricky...