Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10599

Improve LogMergePolicy's handling of maxMergeSize

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.3
    • None
    • None
    • New

    Description

      LogMergePolicy excludes from merging segments whose size is greater than or equal to maxMergeSize. Since a segment whose size is maxMergeSize-1 is still considered for merging, segments will effectively reach a size somewhere between maxMergeSize and mergeFactor*maxMergeSize before they are not considered for merging anymore.

      At least this is what I thought. When LogMergePolicy ignores a segment that is too large for merging, it also ignores other segments that are in the same window of mergeFactor segments for merging if they are on the same tier. So actually segments might reach a size that is somewhere between maxMergeSize / mergeFactor^0.75 and maxMergeSize * mergeFactor before they are not considered for merging anymore.

      Assuming a merge factor of 10 and a max merge size of 1,000 this means that segments will reach their maximum size somewhere between 178 and 10,000. This range is too large and makes maxMergeSize too hard to reason about?

      Specifically, if you have 10 999-docs segments, then LogDocMergePolicy will happily merge them into a single 9990-docs segment. However if you have one 1,000 segment and 9 180-docs segments, then the 180-docs segments will not get merged with any other segment, even if you keep adding segments to the index.

      I propose to change this behavior so that when a large segment is encountered, then we wouldn't skip the entire window of mergeFactor segments, but just the segments that are too large.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpountz Adrien Grand
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m

                  Slack

                    Issue deployment