Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10216

Add concurrency to addIndexes(CodecReader…) API



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • main, 9.4
    • core/index
    • None
    • New


      I work at Amazon Product Search, and we use Lucene to power search for the e-commerce platform. I’m working on a project that involves applying metadata+ETL transforms and indexing documents on n different indexing boxes, combining them into a single index on a separate reducer box, and making it available for queries on m different search boxes (replicas). Segments are asynchronously copied from indexers to reducers to searchers as they become available for the next layer to consume.

      I am using the addIndexes API to combine multiple indexes into one on the reducer boxes. Since we also have taxonomy data, we need to remap facet field ordinals, which means I need to use the addIndexes(CodecReader…) version of this API. The API leverages SegmentMerger.merge() to create segments with new ordinal values while also merging all provided segments in the process.

      This is however a blocking call that runs in a single thread. Until we have written segments with new ordinal values, we cannot copy them to searcher boxes, which increases the time to make documents available for search.

      I was playing around with the API by creating multiple concurrent merges, each with only a single reader, creating a concurrently running 1:1 conversion from old segments to new ones (with new ordinal values). We follow this up with non-blocking background merges. This lets us copy the segments to searchers and replicas as soon as they are available, and later replace them with merged segments as background jobs complete. On the Amazon dataset I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given about 5 readers to add on average.

      This might be useful add to Lucene. We could create another addIndexes() API with a boolean flag for concurrency, that internally submits multiple merge jobs (each with a single reader) to the ConcurrentMergeScheduler, and waits for them to complete before returning.

      While this is doable from outside Lucene by using your thread pool, starting multiple addIndexes() calls and waiting for them to complete, I felt it needs some understanding of what addIndexes does, why you need to wait on the merge and why it makes sense to pass a single reader in the addIndexes API.

      Out of box support in Lucene could simplify this for folks a similar use case.




            Unassigned Unassigned
            vigyas Vigya Sharma
            0 Vote for this issue
            5 Start watching this issue



              Time Tracking

                Original Estimate - Not Specified
                Not Specified
                Remaining Estimate - 0h
                Time Spent - 11h 40m
                11h 40m