Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
OrdinalMap is very useful when computing top terms on a multi-index segment. However I've seen it being occasionally slow to build, which was either making facets (when the ordinals map is computed lazily) or reopen (when computed eagerly) slow. So out of curiosity, I tried to profile ordinal map building on a simple index: 10M random strings of length between 0 and 20 stored as a SORTED doc values field. The index has 19 segments. The bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef comparisons (UTF8SortedAsUnicodeComparator).
MultiTermsEnum stores sub enums in two different places:
- top: a simple array containing all enums on the current term
- queue: a queue for enums that are not exhausted yet but beyond the current term.
A non-exhausted enum is in exactly one of these data-structures. When moving to the next term, MultiTermsEnum advances all enums in top, then adds them to queue and finally, pops all enum that are on the same term back into top.
We could save reorderings of the priority queue by not removing entries from the priority queue and then calling updateTop to advance enums which are on the current term. This is already what we do for disjunctions of doc IDs in DISIPriorityQueue.
On the index described above and current trunk, building an OrdinalMap has to call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s. With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694 times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster).