[LUCENE-9932] Performance improvement for BKD index building - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 8.8.2
Fix Version/s: 8.9
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

In BKD index building, the input bytes must be sorted before calling BKD writer related API. The sorting method leverages MSB Radix Sort algorithm, and the comparing method takes both the bytes itself and the DocId, but in real cases, DocIds are usually monotonically increasing. This could yield one possible performance enhancer. I found this enhancement when I dig into one performance issue in our system. Then I research on the possible solution.

DocId is usually increased by one when building index in a thread-safe way, by assuming such condition, the comparing method can eliminate the unnecessary comparing input - DocId, only leave the bytes itself to compare. In order to do so, MSB radix sorting and its fallback sorting method must be stable, so that when elements are the same, the sorting method maintains its original order when added, which makes DocId still monotonically increasing. To make MSB Radix Sort stable, it needs a trivial update; to make fallback sort table, use merge sort instead of quick sort. Meanwhile, there should introduce a switch which is able to turn the stable option on or off.

To validate how much performance could be gained. I make a benchmark taking down only the time elapsed in MutablePointsReaderUtils.sort stage.

Test environment:
MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 MHz DDR3

Java version:
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Testcase:
bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
dim = 1
doc num = 2,000,000
warm up 5 time, run 10 times to calculate average time used.

Result:

bytesPerDim\scenario	disable sort doc id (PR branch)	enable sort doc id (master branch)
1	30989.594 us	1151149.9 us
2	313469.47 us	1115595.1 us
3	844617.8 us	1465465.1 us
4	1350946.8 us	1465465.1 us
8	1344814.6 us	1458115.5 us
16	1344516.6 us	1459849.6 us
32	1386847.8 us	1583097.5 us

Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data cardinality is high (bytesPerDim >= 4, test cases will generate random bytes which are more scatter, not likely to be duplicate), the performance does not go backward, still a little better.

In conclusion, in the end to end process for building BKD index, which relies on BKDWriter for some data types, performance could be better by ignoring DocId if they are already monotonically increasing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

benchmark_data.png
17/Apr/21 15:22
89 kB
neoremind
refined-code-benchmark.png
21/Apr/21 10:02
304 kB
neoremind
flame-graph.png
21/Apr/21 10:13
372 kB
neoremind
refined-code-benchmark2.png
21/Apr/21 15:12
348 kB
neoremind

Issue Links

links to

GitHub Pull Request #91

Activity

People

Assignee:: Unassigned

Reporter:: neoremind

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Apr/21 15:21

Updated:: 28/Aug/22 16:20

Resolved:: 14/May/21 08:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

12.5h