[LUCENE-10033] Encode doc values in smaller blocks of values, like postings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

This is a follow-up to the discussion on this thread: https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.

Our current approach for doc values uses large blocks of 16k values where values can be decompressed independently, using DirectWriter/DirectReader. This is a bit inefficient in some cases, e.g. a single outlier can grow the number of bits per value for the entire block, we can't easily use run-length compression, etc. Plus, it encourages using a different sub-class for every compression technique, which puts pressure on the JVM.

We'd like to move to an approach that would be more similar to postings with smaller blocks (e.g. 128 values) whose values get all decompressed at once (using SIMD instructions), with skip data within blocks in order to efficiently skip to arbitrary doc IDs (or maybe still use jump tables as today's doc values, and as discussed here for postings: https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

benchmark
26/Aug/21 06:53
464 kB
weizijun
benchmark-10m
27/Aug/21 08:12
538 kB
weizijun

Issue Links

is depended upon by

LUCENE-7806 Explore delta of delta encoding

Open

links to

GitHub Pull Request #227

Activity

People

Assignee:: Unassigned

Reporter:: Adrien Grand

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jul/21 17:31

Updated:: 28/Aug/22 16:23

Resolved:: 13/Sep/21 16:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m