Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
New, Patch Available
Description
Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal.
Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded).
This patch proposes the optimization.
To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization:
TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads.
TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads.
In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it.
Attachments
Issue Links
- is duplicated by
-
LUCENE-9025 Add more efficient lookupTerm() overload to SortedSetDocValues
- Resolved
- links to