Lucene - Core
  1. Lucene - Core
  2. LUCENE-854

Create merge policy that doesn't periodically inadvertently optimize

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current merge policy, at every maxBufferedDocs *
      power-of-mergeFactor docs added, will do a fully cascaded merge, which
      is the same as an optimize.

      I think this is not good because at that "optimization poin", the
      particular addDocument call is [surprisingly] very expensive. While,
      amortized over all addDocument calls, the cost is low, the cost is
      paid "up front" and in a very "bunched up" manner.

      I think of this as "pay it forward": you are paying the full cost of
      an optimize right now on the expectation / hope that you will be
      adding a great many more docs. But, if you don't add that many more
      docs, then, the amortized cost for your index is in fact far higher
      than it should have been. Better to "pay as you go" instead.

      So we could make a small change to the policy by only merging the
      first mergeFactor segments once we hit 2X the merge factor. With
      mergeFactor=10, when we have created the 20th level 0 (just flushed)
      segment, we merge the first 10 into a level 1 segment. Then on
      creating another 10 level 0 segments, we merge the second set of 10
      level 0 segments into a level 1 segment, etc.

      With this new merge policy, an index that's a bit bigger than a
      current "optimization point" would then have a lower amortized cost
      per document. Plus the merge cost is less "bunched up" and less "pay
      it forward": instead you pay for what you are actually using.

      We can start by creating this merge policy (probably, combined with
      with the "by size not by doc count" segment level computation from
      LUCENE-845) and then later decide whether we should make it the
      default merge policy.

      1. LUCENE-854.patch
        110 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2
        Hide
        Uwe Schindler added a comment -

        Nice videos, have already seen them on twitter yesterday g

        Show
        Uwe Schindler added a comment - Nice videos, have already seen them on twitter yesterday g
        Hide
        Michael McCandless added a comment -

        I put up a blog post showing a movie of how TieredMP differs from LogByteSizeMP: http://chbits.blogspot.com/2011/02/visualizing-lucenes-segment-merges.html

        Show
        Michael McCandless added a comment - I put up a blog post showing a movie of how TieredMP differs from LogByteSizeMP: http://chbits.blogspot.com/2011/02/visualizing-lucenes-segment-merges.html
        Hide
        Michael McCandless added a comment -

        I created a new merge policy, to take advantage of non-contiguous merging (LUCENE-1076) and fix certain limitations of LogMergePolicy.

        The new policy does not support contiguous merging, and always merges according to byte size, always pro rated by pct deletes.

        The policy's core logic is similar to LogMP, in that it tries to merge roughly equal sized segments at once, maxMergeAtOnce (renamed from mergeFactor) at a time, resulting in the usual exponential staircase pattern when you feed it roughly equal sized segments.

        You configure the approx max merged segment size (unlike LogMP where you configure the max to-be-merged size, which was always a source of confusion!). Unlike LogMP, when segments are getting close to being too large, the new policy will merge fewer segs, eg down to merging pairwise, to reach approx the max allowed size. This is important since it makes that setting more "accurate"; I now default it to 5 GB (vs LogMP's 2 GB).

        There is a separate maxMergeAtOnceExplicit that controls "explicit" merging (ie, app calls optimize or expungeDeletes, and maybe in the future also addIndexes); I defaulted it to 30. There is no max segment size for optimize.

        The big difference vs LogMP is that the new policy does not "over-merge", meaning it does not "pay it forward"/forcefully cascade the way LogMP does today. This fixes the "inadvertent optimize" that LogMP does.

        For any given sized index, the new policy computes a budget of how many segments that index is allowed to have (ie, it enumerates the steps in the stair case, based on mergeAtOnce, [floored] min segment size, and total bytes in the index); then, if the index is over-budget it picks the least-cost merge. This results in a smoother progression over time of number of segments.

        There is a new configuration, segmentsPerTier, that lets you control how many segments per level you can "tolerate". This is a nice knob to turn to tradeoff merge cost vs search cost. It defaults to 10, which means it matches the staircase pattern that LogMP produces, but you can now separately control the "width" of the stairs in the staircase, from how many segments are merged at once for non-explicit merges.

        It has useCompoundFile and noCFSRatio just like LogMP.

        It has a new setting "expungeDeletesPctAllowed", default 10%, which allows expungeDeletes to skip merging a segment if it has < 10% deletions.

        I think we should keep LogMergePolicy available for apps that want contiguous merging, merge by doc count, to not pro-rate by deletions, or to enforce a max segment size during optimize. But, with this, I'd remove the non-contiguous support for LogMergePolicy that was added under LUCENE-1076, and make this new MP the default one.

        Show
        Michael McCandless added a comment - I created a new merge policy, to take advantage of non-contiguous merging ( LUCENE-1076 ) and fix certain limitations of LogMergePolicy. The new policy does not support contiguous merging, and always merges according to byte size, always pro rated by pct deletes. The policy's core logic is similar to LogMP, in that it tries to merge roughly equal sized segments at once, maxMergeAtOnce (renamed from mergeFactor) at a time, resulting in the usual exponential staircase pattern when you feed it roughly equal sized segments. You configure the approx max merged segment size (unlike LogMP where you configure the max to-be-merged size, which was always a source of confusion!). Unlike LogMP, when segments are getting close to being too large, the new policy will merge fewer segs, eg down to merging pairwise, to reach approx the max allowed size. This is important since it makes that setting more "accurate"; I now default it to 5 GB (vs LogMP's 2 GB). There is a separate maxMergeAtOnceExplicit that controls "explicit" merging (ie, app calls optimize or expungeDeletes, and maybe in the future also addIndexes); I defaulted it to 30. There is no max segment size for optimize. The big difference vs LogMP is that the new policy does not "over-merge", meaning it does not "pay it forward"/forcefully cascade the way LogMP does today. This fixes the "inadvertent optimize" that LogMP does. For any given sized index, the new policy computes a budget of how many segments that index is allowed to have (ie, it enumerates the steps in the stair case, based on mergeAtOnce, [floored] min segment size, and total bytes in the index); then, if the index is over-budget it picks the least-cost merge. This results in a smoother progression over time of number of segments. There is a new configuration, segmentsPerTier, that lets you control how many segments per level you can "tolerate". This is a nice knob to turn to tradeoff merge cost vs search cost. It defaults to 10, which means it matches the staircase pattern that LogMP produces, but you can now separately control the "width" of the stairs in the staircase, from how many segments are merged at once for non-explicit merges. It has useCompoundFile and noCFSRatio just like LogMP. It has a new setting "expungeDeletesPctAllowed", default 10%, which allows expungeDeletes to skip merging a segment if it has < 10% deletions. I think we should keep LogMergePolicy available for apps that want contiguous merging, merge by doc count, to not pro-rate by deletions, or to enforce a max segment size during optimize. But, with this, I'd remove the non-contiguous support for LogMergePolicy that was added under LUCENE-1076 , and make this new MP the default one.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development