Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-982

Create new method optimize(int maxNumSegments) in IndexWriter

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinning this out from the discussion in LUCENE-847.

      I think having a way to "slightly optimize" your index would be useful
      for many applications.

      The current optimize() call is very expensive for large indices
      because it always optimizes fully down to 1 segment. If we add a new
      method which instead is allowed to stop optimizing once it has <=
      maxNumSegments segments in the index, this would allow applications to
      eg optimize down to say <= 10 segments after doing a bunch of updates.
      This should be a nice compromise of gaining good speedups of searching
      while not spending the full (and typically very high) cost of
      optimizing down to a single segment.

      Since LUCENE-847 is now formalizing an API for decoupling merge policy
      from IndexWriter, if we want to add this new optimize method we need
      to take it into account in LUCENE-847.

      1. LUCENE-982.patch
        14 kB
        Michael McCandless

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        +1
        sounds like a great idea.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - +1 sounds like a great idea.
        Hide
        klaasm Mike Klaas added a comment -

        One heuristic that has been quite useful for us is to skip optimizing segments that occupy some fixed fraction of the index. The remainder of the segments are optimized as usual (the heuristic can be applied recursively). 70% is a decent number.

        Show
        klaasm Mike Klaas added a comment - One heuristic that has been quite useful for us is to skip optimizing segments that occupy some fixed fraction of the index. The remainder of the segments are optimized as usual (the heuristic can be applied recursively). 70% is a decent number.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch to implement new optimize(int maxNumSegments).

        I fixed LogMergePolicy to respect the maxNumSegments arg. When
        optimizing, we first concurrently do every mergeFactor section of
        segment merges (going back from the tail).

        Then, for the final partial (< mergeFactor segments) merge, we pick
        contiguous segments that are the smallest net size, as long as the
        resulting merged segment is not > 2X larger than the segment to its
        left (to prevent creating a lopsided index over time).

        Show
        mikemccand Michael McCandless added a comment - Attached patch to implement new optimize(int maxNumSegments). I fixed LogMergePolicy to respect the maxNumSegments arg. When optimizing, we first concurrently do every mergeFactor section of segment merges (going back from the tail). Then, for the final partial (< mergeFactor segments) merge, we pick contiguous segments that are the smallest net size, as long as the resulting merged segment is not > 2X larger than the segment to its left (to prevent creating a lopsided index over time).
        Hide
        michaelbusch Michael Busch added a comment -

        Mike,

        I looked at your patch and I'm wondering if it wouldn't make more
        sense to limit the overall size of the segments (MB and/or num docs)
        involved in a merge rather than the number of segments?

        Show
        michaelbusch Michael Busch added a comment - Mike, I looked at your patch and I'm wondering if it wouldn't make more sense to limit the overall size of the segments (MB and/or num docs) involved in a merge rather than the number of segments?
        Hide
        mikemccand Michael McCandless added a comment -

        I looked at your patch and I'm wondering if it wouldn't make more
        sense to limit the overall size of the segments (MB and/or num docs)
        involved in a merge rather than the number of segments?

        Thanks for reviewing

        I think that's a good idea!

        But I'm torn on which is "better" as a first step.

        If we limit by size, then the benefit is, even as your index grows
        very large, the cost of optimizing is constant once you hit the max
        segment size. You keep your optimize cost down.

        But, then, your searches will get slower and slower as your index
        grows since these large segments never get merged (actually you'd have
        to set maxMergeDocs as well so normal merging wouldn't merge them).

        But limiting by segment count, I think you keep you search costs
        lower, at the expense of higher and higher optimize costs as your
        index gets larger.

        I think people optimize because they want to pay a high cost, once,
        now, in order to have fast[er] searches. So by limiting segment count
        during optimizing, we still leave the increasing cost (as your index
        grows) on the optimize() call.

        I think we should eventually do both?

        The good news is with the ability to customize MergePolicy, anyone can
        customize what it means to "optimize" an index just by implementing
        their own MergePolicy.

        Show
        mikemccand Michael McCandless added a comment - I looked at your patch and I'm wondering if it wouldn't make more sense to limit the overall size of the segments (MB and/or num docs) involved in a merge rather than the number of segments? Thanks for reviewing I think that's a good idea! But I'm torn on which is "better" as a first step. If we limit by size, then the benefit is, even as your index grows very large, the cost of optimizing is constant once you hit the max segment size. You keep your optimize cost down. But, then, your searches will get slower and slower as your index grows since these large segments never get merged (actually you'd have to set maxMergeDocs as well so normal merging wouldn't merge them). But limiting by segment count, I think you keep you search costs lower, at the expense of higher and higher optimize costs as your index gets larger. I think people optimize because they want to pay a high cost, once, now, in order to have fast [er] searches. So by limiting segment count during optimizing, we still leave the increasing cost (as your index grows) on the optimize() call. I think we should eventually do both? The good news is with the ability to customize MergePolicy, anyone can customize what it means to "optimize" an index just by implementing their own MergePolicy.
        Hide
        michaelbusch Michael Busch added a comment -

        I think people optimize because they want to pay a high cost, once,
        now, in order to have fast[er] searches. So by limiting segment count
        during optimizing, we still leave the increasing cost (as your index
        grows) on the optimize() call.

        Yeah good point. I understand the usecase for maxNumSegments.

        I think we should eventually do both?

        +1

        Show
        michaelbusch Michael Busch added a comment - I think people optimize because they want to pay a high cost, once, now, in order to have fast [er] searches. So by limiting segment count during optimizing, we still leave the increasing cost (as your index grows) on the optimize() call. Yeah good point. I understand the usecase for maxNumSegments. I think we should eventually do both? +1
        Hide
        mikemccand Michael McCandless added a comment -

        OK I plan to commit this in a day or two.

        Show
        mikemccand Michael McCandless added a comment - OK I plan to commit this in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this.

        Show
        mikemccand Michael McCandless added a comment - I just committed this.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development