[LUCENE-7407] Explore switching doc values to an iterator API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.0
Component/s: None
Labels:
- docValues

Lucene Fields:

New

Description

I think it could be compelling if we restricted doc values to use an
iterator API at read time, instead of the more general random access
API we have today:

It would make doc values disk usage more of a "you pay for what
what you actually use", like postings, which is a compelling
reduction for sparse usage.

I think codecs could compress better and maybe speed up decoding
of doc values, even in the non-sparse case, since the read-time
API is more restrictive "forward only" instead of random access.

We could remove getDocsWithField entirely, since that's
implicit in the iteration, and the awkward "return 0 if the
document didn't have this field" would go away.

We can remove the annoying thread locals we must make today in
CodecReader, and close the trappy "I accidentally shared a
single XXXDocValues instance across threads", since an iterator is
inherently "use once".

We could maybe leverage the numerous optimizations we've done for
postings over time, since the two problems ("iterate over doc ids
and store something interesting for each") are very similar.

This idea has come up many in the past, e.g. ~~LUCENE-7253~~ is a recent
example, and very early iterations of doc values started with exactly
this

However, it's a truly enormous change, likely 7.0 only. Or maybe we
could have the new iterator APIs also ported to 6.x side by side with
the deprecate existing random-access APIs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7407.patch
17/Sep/16 11:01
1.20 MB
Michael McCandless

Issue Links

breaks

SOLR-9599 DocValues performance regression with new iterator API

Open

SOLR-10596 unlque and hll functions don't work after first bucket

Resolved

SOLR-11664 range facets with with sub aggregations on string fields give incorrect results

Closed

is related to

SOLR-9837 Performance regression of numeric field uninversion time

Resolved

SOLR-9582 TestSortingResponseWriter.testSortingOutput() failure: docs were sent out-of-order

Resolved

LUCENE-7835 ToChildBlockJoinSortField to sort children by a parent field

Patch Available

SOLR-13024 ValueSourceAugmenter - avoid creating new FunctionValues per doc

Open

LUCENE-7253 Make sparse doc values and segments merging more efficient

Resolved

LUCENE-7474 Improve doc values writers

Resolved

LUCENE-10534 MinFloatFunction / MaxFloatFunction calls exists twice

Closed

LUCENE-10542 FieldSource exists implementations can avoid value retrieval

Closed

relates to

SOLR-9628 Trie fields have unset lastDocId

Resolved

LUCENE-7871 false positive match BlockJoinSelector[SortedDV] when child value is absent

Closed

LUCENE-7835 ToChildBlockJoinSortField to sort children by a parent field

Patch Available

LUCENE-7457 Default doc values format should optimize for iterator access

Resolved

LUCENE-5542 Explore making DVConsumer sparse-aware

Resolved

LUCENE-7459 LegacyNumericDocValuesWrapper should only check bits when the value is != 0

Resolved

LUCENE-7461 Refactor doc values queries to better use the new doc values APIs

Resolved

LUCENE-7462 Faster search APIs for doc values

Resolved

LUCENE-7463 Create a Lucene70DocValuesFormat

Resolved

LUCENE-7475 Sparse norms

Resolved

LUCENE-7489 Improve sparsity support of Lucene70DocValuesFormat

Resolved

LUCENE-7460 Should SortedNumericDocValues expose a per-document random-access API?

Resolved

(6 is related to, 12 relates to)

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 05/Aug/16 17:23

Updated:: 28/Aug/22 15:01

Resolved:: 21/Sep/16 13:42