[LUCENE-10216] Add concurrency to addIndexes(CodecReader…) API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: main, 9.4
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

I work at Amazon Product Search, and we use Lucene to power search for the e-commerce platform. I’m working on a project that involves applying metadata+ETL transforms and indexing documents on n different indexing boxes, combining them into a single index on a separate reducer box, and making it available for queries on m different search boxes (replicas). Segments are asynchronously copied from indexers to reducers to searchers as they become available for the next layer to consume.

I am using the addIndexes API to combine multiple indexes into one on the reducer boxes. Since we also have taxonomy data, we need to remap facet field ordinals, which means I need to use the addIndexes(CodecReader…) version of this API. The API leverages SegmentMerger.merge() to create segments with new ordinal values while also merging all provided segments in the process.

This is however a blocking call that runs in a single thread. Until we have written segments with new ordinal values, we cannot copy them to searcher boxes, which increases the time to make documents available for search.

I was playing around with the API by creating multiple concurrent merges, each with only a single reader, creating a concurrently running 1:1 conversion from old segments to new ones (with new ordinal values). We follow this up with non-blocking background merges. This lets us copy the segments to searchers and replicas as soon as they are available, and later replace them with merged segments as background jobs complete. On the Amazon dataset I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given about 5 readers to add on average.

This might be useful add to Lucene. We could create another addIndexes() API with a boolean flag for concurrency, that internally submits multiple merge jobs (each with a single reader) to the ConcurrentMergeScheduler, and waits for them to complete before returning.

While this is doable from outside Lucene by using your thread pool, starting multiple addIndexes() calls and waiting for them to complete, I felt it needs some understanding of what addIndexes does, why you need to wait on the merge and why it makes sense to pass a single reader in the addIndexes API.

Out of box support in Lucene could simplify this for folks a similar use case.

Attachments

Issue Links

links to

GitHub Pull Request #633

GitHub Pull Request #1051

Sub-Tasks

Modify the findMerges policy for addIndexes(CodecReader[]) to create 1 segment per reader

Open

Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Vigya Sharma

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Nov/21 02:13

Updated:: 28/Aug/22 16:29

Resolved:: 01/Aug/22 14:44

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

11h 40m

Include sub-tasks