[LUCENE-8836] Optimize DocValues TermsDict to continue scanning from the last position when possible - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.2
Component/s: None
Labels:
- docValues
- optimization

Lucene Fields:

New, Patch Available

Description

Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal.

Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded).

This patch proposes the optimization.

To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization:

TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads.
TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads.

In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it.

Attachments

Issue Links

is duplicated by

LUCENE-9025 Add more efficient lookupTerm() overload to SortedSetDocValues

Resolved

links to

GitHub Pull Request #701

GitHub Pull Request #827

Activity

People

Assignee:: Unassigned

Reporter:: Bruno Roustant

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 06/Jun/19 12:54

Updated:: 22/Sep/22 17:41

Resolved:: 25/Apr/22 08:21

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m