[LUCENE-7462] Faster search APIs for doc values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 7.0
Fix Version/s: 7.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

While the iterator API helps deal with sparse doc values more efficiently, it also makes search-time operations more costly. For instance, the old random-access API allowed to compute facets on a given segment without any conditionals, by just incrementing the counter at index ordinal+1 while the new API requires to advance the iterator if necessary and then check whether it is exactly on the right document or not.

Since it is very common for fields to exist across most documents, I suspect codecs will keep an internal structure that is similar to the current codec in the dense case, by having a dense representation of the data and just making the iterator skip over the minority of documents that do not have a value.

I suggest that we add APIs that make things cheaper at search time. For instance in the case of SORTED doc values, it could look like LegacySortedDocValues with the additional restriction that documents can only be consumed in order. Codecs that can implement this API efficiently would hide it behind a SortedDocValues adapter, and then at search time facets and comparators (which liked the LegacySortedDocValues API better) would either unwrap or hide the SortedDocValues they got behind a more random-access API (which would only happen in the truly sparse case if the codec optimizes the dense case).

One challenge is that we already use the same idea for hiding single-valued impls behind multi-valued impls, so we would need to enforce the order in which the wrapping needs to happen. At first sight, it seems that it would be best to do the single-value-behind-multi-value-API wrapping above the random-access-behind-iterator-API wrapping. The complexity of wrapping/unwrapping in the right order could be contained in the DocValues helper class.

I think this change would also simplify search-time consumption of doc values, which currently needs to spend several lines of code positioning the iterator everytime it needs to do something interesting with doc values.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7462.patch
20/Oct/16 13:14
108 kB
Adrien Grand
LUCENE-7462-advanceExact.patch
19/Oct/16 09:31
11 kB
Adrien Grand

Issue Links

is related to

SOLR-9599 DocValues performance regression with new iterator API

Open

LUCENE-7407 Explore switching doc values to an iterator API

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Adrien Grand

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Sep/16 12:37

Updated:: 28/Aug/22 15:03

Resolved:: 24/Oct/16 08:57