[LUCENE-8374] Reduce reads for sparse DocValues - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 7.5, 8.0
Fix Version/s: None
Component/s: core/codecs
Labels:
- performance

Lucene Fields:

New, Patch Available

Description

The Lucene70DocValuesProducer has the internal classes SparseNumericDocValues and BaseSortedSetDocValues (sparse code path), which again uses IndexedDISI to handle the docID -> value-ordinal lookup. The value-ordinal is the index of the docID assuming an abstract tightly packed monotonically increasing list of docIDs: If the docIDs with corresponding values are [0, 4, 1432], their value-ordinals will be [0, 1, 2].

Outer blocks

The lookup structure of IndexedDISI consists of blocks of 2^16 values (65536), where each block can be either ALL, DENSE (2^12 to 2^16 values) or SPARSE (< 2^12 values ~= 6%). Consequently blocks vary quite a lot in size and ordinal resolving strategy.

When a sparse Numeric DocValue is needed, the code first locates the block containing the wanted docID flag. It does so by iterating blocks one-by-one until it reaches the needed one, where each iteration requires a lookup in the underlying IndexSlice. For a common memory mapped index, this translates to either a cached request or a read operation. If a segment has 6M documents, worst-case is 91 lookups. In our web archive, our segments has ~300M values: A worst-case of 4577 lookups!

One obvious solution is to use a lookup-table for blocks: A long[]-array with an entry for each block. For 6M documents, that is < 1KB and would allow for direct jumping (a single lookup) in all instances. Unfortunately this lookup-table cannot be generated upfront when the writing of values is purely streaming. It can be appended to the end of the stream before it is closed, but without knowing the position of the lookup-table the reader cannot seek to it.

One strategy for creating such a lookup-table would be to generate it during reads and cache it for next lookup. This does not fit directly into how IndexedDISI currently works (it is created anew for each invocation), but could probably be added with a little work. An advantage to this is that this does not change the underlying format and thus could be used with existing indexes.

The lookup structure inside each block

If ALL of the 2^16 values are defined, the structure is empty and the ordinal is simply the requested docID with some modulo and multiply math. Nothing to improve there.

If the block is DENSE (2^12 to 2^16 values are defined), a bitmap is used and the number of set bits up to the wanted index (the docID modulo the block origo) are counted. That bitmap is a long[1024], meaning that worst case is to lookup and count all set bits for 1024 longs!

One known solution to this is to use a [rank structure|https://en.wikipedia.org/wiki/Succinct_data_structure]. I [implemented it|https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java] for a related project and with that (), the rank-overhead for a DENSE block would be long[32] and would ensure a maximum of 9 lookups. It is not trivial to build the rank-structure and caching it (assuming all blocks are dense) for 6M documents would require 22 KB (3.17% overhead). It would be far better to generate the rank-structure at index time and store it immediately before the bitset (this is possible with streaming as each block is fully resolved before flushing), but of course that would require a change to the codec.

If SPARSE (< 2^12 values ~= 6%) are defined, the docIDs are simply in the form of a list. As a comment in the code suggests, a binary search through these would be faster, although that would mean seeking backwards. If that is not acceptable, I don't have any immediate idea for avoiding the full iteration.

I propose implementing query-time caching of both block-jumps and inner-block lookups for DENSE (using rank) as first improvement and an index-time DENSE-rank structure for future improvement. As query-time caching is likely to be too costly for rapidly-changing indexes, it should probably be an opt-in in solrconfig.xml.

Some real-world observations

This analysis was triggered by massive (10x) slowdown problems with both simple querying and large exports from our webarchive index after upgrading from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but returning the top-10 documents takes 5-20 seconds (~50 non-stored DocValues fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored vs. DocValues, so might not be directly comparable).

Measuring with VisualVM points to NIOFSIndexInput.readInternal as the hotspot. We ran some tests with simple queries on a single 307,171,504 document segment with different single-value DocValued fields in the fl and got

Field	Type	Docs with value	Docs w/ val %	Speed in docs/sec
url	String	307,171,504	100%	12,500
content_type_ext	String	224,375,378	73%	360
author	String	1,506,365	0.5%	1,100
crawl_date	DatePoint	307,171,498	~100%	90
content_text_length	IntPoint	285,800,212	93%	410
content_length	IntPoint	307,016,816	99.9%	100
crawl_year	IntPoint	307,171,498	~100%	14,500
last_modified	DatePoint	6,835,065	2.2%	570
source_file_offset	LongPoint	307,171,504	100%	28,000

Note how both url and source_file_offset are very fast and also has a value for all documents. Contrary to this, content_type_ext is very slow and crawl_date is extremely slow and as they both have nearly all documents, I presume they are using IndexedDISI#DENSE. last_modified is also quite slow and presumably uses IndexedDISI#SPARSE.

The only mystery is crawl_year which is also present in nearly all documents, but is very fast. I have no explanation for that one yet.

I hope to take a stab at this around August 2018, but no promises.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8374_part_2.patch
26/Nov/18 14:03
38 kB
Toke Eskildsen
LUCENE-8374_part_1.patch
26/Nov/18 14:03
44 kB
Toke Eskildsen
LUCENE-8374_part_3.patch
26/Nov/18 14:03
21 kB
Toke Eskildsen
LUCENE-8374_part_4.patch
26/Nov/18 14:03
2 kB
Toke Eskildsen
LUCENE-8374.patch
09/Nov/18 20:50
91 kB
Toke Eskildsen
LUCENE-8374.patch
07/Nov/18 13:32
95 kB
Toke Eskildsen
entire_index_logs.txt
27/Oct/18 01:39
586 kB
Tim Underwood
single_vehicle_logs.txt
27/Oct/18 01:39
472 kB
Tim Underwood
start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
24/Oct/18 14:32
801 kB
Tim Underwood
start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
24/Oct/18 14:32
820 kB
Tim Underwood
image-2018-10-24-07-30-56-962.png
24/Oct/18 14:30
818 kB
Tim Underwood
image-2018-10-24-07-30-06-663.png
24/Oct/18 14:30
813 kB
Tim Underwood
LUCENE-8374_branch_7_3.patch.20181005
22/Oct/18 07:37
113 kB
Toke Eskildsen
LUCENE-8374_branch_7_5.patch
22/Oct/18 07:34
106 kB
Toke Eskildsen
LUCENE-8374.patch
10/Sep/18 13:06
110 kB
Toke Eskildsen
LUCENE-8374.patch
27/Aug/18 11:37
100 kB
Toke Eskildsen
LUCENE-8374.patch
16/Aug/18 20:00
102 kB
Toke Eskildsen
LUCENE-8374_branch_7_3.patch
14/Aug/18 09:45
75 kB
Toke Eskildsen
LUCENE-8374_branch_7_4.patch
14/Aug/18 09:43
75 kB
Toke Eskildsen
LUCENE-8374.patch
13/Aug/18 13:53
79 kB
Toke Eskildsen
LUCENE-8374.patch
06/Jul/18 09:05
30 kB
Toke Eskildsen

Issue Links

relates to

SOLR-9599 DocValues performance regression with new iterator API

Open

LUCENE-7589 Prevent outliers from raising the number of bits of everyone with numeric doc values

Resolved

LUCENE-8585 Create jump-tables for DocValues at index-time

Closed

Reduce reads for sparse DocValues

Details

Description

Outer blocks

The lookup structure inside each block

Some real-world observations

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment