Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8580

Make segment merging parallel in SegmentMerger

Details

    • Task
    • Status: Reopened
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      A placeholder issue stemming from the discussion on the mailing list [1]. Not of any high priority.

      At the moment any merging from N segments into one will happen sequentially for each data structure involved in a segment (postings, norms, points, etc.). If the input segments are large, the CPU (and I/O) are mostly unused and the process takes a long time.

      Merging of these data structures is mostly independent of each other, so it'd be interesting to see if we can speed things up by allowing them to run concurrently. I investigated this on a 40GB index with 22 segments, force-merging this into 1 segment (of similar size). Quick and dirty patch attached.

      I see some improvement, although it's not by much; the largest component dominates everything else.

      Results from an 8-core CPU.
      Before:

      SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields [41922110 docs]
      SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 docs]
      SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings [41922110 docs]
      SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 docs]
      SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
      SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 docs]
      
      IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 docs
      

      After:

      SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec to merge norms
      SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to merge doc values
      SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to merge points
      SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge store matchedCount=22 vs 22
      SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 msec to merge stored fields [41922110 docs]
      SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 msec to merge postings
      
      IW 0 [2018-11-30T10:32:23.513Z; main]: merge time  950249 msec for 41922110 docs
      

      Ideally, one would need to push forkjoin into individual subroutines so that, for example, postings utilize concurrency when merging (pulling blocks of terms concurrently from the input, calculating statistics, etc. and then pushing in an ordered fashion to the codec).

      [1] https://markmail.org/thread/dtejwq42qagykeac

      Attachments

        1. LUCENE-8580.patch
          8 kB
          Dawid Weiss

        Activity

          People

            dweiss Dawid Weiss
            dweiss Dawid Weiss
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: