[LUCENE-8585] Create jump-tables for DocValues at index-time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 8.0
Fix Version/s: 8.0
Component/s: core/codecs
Labels:
- performance

Lucene Fields:

New, Patch Available

Description

As noted in ~~LUCENE-7589~~, lookup of DocValues should use jump-tables to avoid long iterative walks. This is implemented in ~~LUCENE-8374~~ at search-time (first request for DocValues from a field in a segment), with the benefit of working without changes to existing Lucene 7 indexes and the downside of introducing a startup time penalty and a memory overhead.

As discussed in ~~LUCENE-8374~~, the codec should be updated to create these jump-tables at index time. This eliminates the segment-open time & memory penalties, with the potential downside of increasing index-time for DocValues.

The three elements of ~~LUCENE-8374~~ should be transferable to index-time without much alteration of the core structures:

IndexedDISI block offset and index skips: A long (64 bits) for every 65536 documents, containing the offset of the block in 33 bits and the index (number of set bits) up to the block in 31 bits.
It can be build sequentially and should be stored as a simple sequence of consecutive longs for caching of lookups.
As it is fairly small, relative to document count, it might be better to simply memory cache it?
IndexedDISI DENSE (> 4095, < 65536 set bits) blocks: A short (16 bits) for every 8 longs (512 bits) for a total of 256 bytes/DENSE_block. Each short represents the number of set bits up to right before the corresponding sub-block of 512 docIDs.
The shorts can be computed sequentially or when the DENSE block is flushed (probably the easiest). They should be stored as a simple sequence of consecutive shorts for caching of lookups, one logically independent sequence for each DENSE block. The logical position would be one sequence at the start of every DENSE block.
Whether it is best to read all the 16 shorts up front when a DENSE block is accessed or whether it is best to only read any individual short when needed is not clear at this point.
Variable Bits Per Value: A long (64 bits) for every 16384 numeric values. Each long holds the offset to the corresponding block of values.
The offsets can be computed sequentially and should be stored as a simple sequence of consecutive longs for caching of lookups.
The vBPV-offsets has the largest space overhead og the 3 jump-tables and a lot of the 64 bits in each long are not used for most indexes. They could be represented as a simple PackedInts sequence or MonotonicLongValues, with the downsides of a potential lookup-time overhead and the need for doing the compression after all offsets has been determined.

I have no experience with the codec-parts responsible for creating index-structures. I'm quite willing to take a stab at this, although I probably won't do much about it before January 2019. Should anyone else wish to adopt this JIRA-issue or co-work on it, I'll be happy to share.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8585.patch
08/Dec/18 13:05
172 kB
Toke Eskildsen
LUCENE-8585.patch
07/Dec/18 13:25
150 kB
Toke Eskildsen
make_patch_lucene8585.sh
07/Dec/18 07:21
0.6 kB
Toke Eskildsen

Issue Links

is related to

LUCENE-8374 Reduce reads for sparse DocValues

Resolved

links to

GitHub Pull Request #525

Activity

People

Assignee:: Unassigned

Reporter:: Toke Eskildsen

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 03/Dec/18 15:45

Updated:: 02/Sep/24 01:06

Resolved:: 01/Mar/19 10:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10.5h