Lucene - Core
  1. Lucene - Core
  2. LUCENE-2701

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.

      As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.

      I'll attach a patch shortly.

      1. LUCENE-2701.patch
        16 kB
        Shai Erera
      2. LUCENE-2701.patch
        24 kB
        Shai Erera
      3. LUCENE-2701.patch
        25 kB
        Shai Erera
      4. LUCENE-2701-2.patch
        7 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        Patch adds maxMergeMB handling to optimize as well. If there are no segments exceeding the threshold, then only maxNumSegments constraint is taken into account. Basically I've created two private methods findMergesForOptimizeMaxMergeSize and findMergesForOptimizeMaxNumSegments (the original logic). findMergesForOptimize calls the relevant one.

        I've also changed some members to protected and methods as well, for really easy extension of LMP. As a result, I removed two methods from BalancedSegmentsMP that were copied over from LMP.

        I took the opportunity to change OneMerge.segments and userCompoundfile to public - they are final so no risk of changing from the outside. But otherwise, if you would like to write a MP which queries the OneMerge objects, you can't. I added totalSize() to return the total size in bytes of that merge.

        Test + CHANGES entry as well.

        Show
        Shai Erera added a comment - Patch adds maxMergeMB handling to optimize as well. If there are no segments exceeding the threshold, then only maxNumSegments constraint is taken into account. Basically I've created two private methods findMergesForOptimizeMaxMergeSize and findMergesForOptimizeMaxNumSegments (the original logic). findMergesForOptimize calls the relevant one. I've also changed some members to protected and methods as well, for really easy extension of LMP. As a result, I removed two methods from BalancedSegmentsMP that were copied over from LMP. I took the opportunity to change OneMerge.segments and userCompoundfile to public - they are final so no risk of changing from the outside. But otherwise, if you would like to write a MP which queries the OneMerge objects, you can't. I added totalSize() to return the total size in bytes of that merge. Test + CHANGES entry as well.
        Hide
        Shai Erera added a comment -

        Added support for maxMergeDocs as well. Also, I created a test class for size bounded optimize and added several test cases.

        I think it's ready to commit, but I'll wait a few days for some reviews.

        Show
        Shai Erera added a comment - Added support for maxMergeDocs as well. Also, I created a test class for size bounded optimize and added several test cases. I think it's ready to commit, but I'll wait a few days for some reviews.
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        Maybe rename OneMerge.totalSize -> totalSizeInBytes? Hmm does anyone
        actually call this new method?

        Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB
        constraint) is able to merge fewer than mergeFactor segments at a
        time?

        This code is a bit confusing:

               if (last - start - 1 > 1) {
                 // there is more than 1 segment to the right of this one.
                 spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile));
               } else if (start != last - 1 && !isOptimized(infos.info(start + 1))) {
                  spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile));
               }
        

        Both if clauses are doing the same thing right? (Ie merging the chunk
        of segs to the right). Maybe put a comment explaining the 2nd one? (I
        think it's for the case where there's 1 segment to our right but it's
        not optimized, eg the CFS differs?). Or maybe consolidate into a single
        if?

        Show
        Michael McCandless added a comment - Patch looks good! Maybe rename OneMerge.totalSize -> totalSizeInBytes? Hmm does anyone actually call this new method? Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB constraint) is able to merge fewer than mergeFactor segments at a time? This code is a bit confusing: if (last - start - 1 > 1) { // there is more than 1 segment to the right of this one. spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile)); } else if (start != last - 1 && !isOptimized(infos.info(start + 1))) { spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile)); } Both if clauses are doing the same thing right? (Ie merging the chunk of segs to the right). Maybe put a comment explaining the 2nd one? (I think it's for the case where there's 1 segment to our right but it's not optimized, eg the CFS differs?). Or maybe consolidate into a single if?
        Hide
        Shai Erera added a comment -

        You're right about the code - the 'else if' is in case there is one not optimized segment to the right. Added a comment and combined them into one OR-ed if. Also added a test case.

        OneMerge.totalSizeInBytes – no one calls it now, but I would like to write a MP which will, and remove merges that exceed a specified total size. It's just a service method, instead of you needing to write it on your own. I renamed it to totalBytesSize. And on the way added totalNumDocs, doing the same for the number of docs.

        Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB constraint) is able to merge fewer than mergeFactor segments at a time?

        Wasn't it able to do so even before? E.g. if maxNumSegments < numSegments < mergeFactor?

        Show
        Shai Erera added a comment - You're right about the code - the 'else if' is in case there is one not optimized segment to the right. Added a comment and combined them into one OR-ed if. Also added a test case. OneMerge.totalSizeInBytes – no one calls it now, but I would like to write a MP which will, and remove merges that exceed a specified total size. It's just a service method, instead of you needing to write it on your own. I renamed it to totalBytesSize. And on the way added totalNumDocs, doing the same for the number of docs. Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB constraint) is able to merge fewer than mergeFactor segments at a time? Wasn't it able to do so even before? E.g. if maxNumSegments < numSegments < mergeFactor?
        Hide
        Shai Erera added a comment -

        Committed revision 1025544 (3x).
        Committed revision 1025577 (trunk).

        Show
        Shai Erera added a comment - Committed revision 1025544 (3x). Committed revision 1025577 (trunk).
        Hide
        Simon Willnauer added a comment -

        This change together with LUCENE-2773 caused a change of the IW#optimize() and friends semantics.
        IW#optimize() says:

         /**
           * Requests an "optimize" operation on an index, priming the index
           * for the fastest available search. Traditionally this has meant
           * merging all segments into a single segment as is done in the
           * default merge policy, but individual merge policies may implement
           * optimize in different ways.
           *
        
        

        Which is not entirely true anymore since default now uses

          /** Default maximum segment size.  A segment of this size
           *  or larger will never be merged.  @see setMaxMergeMB */
          public static final double DEFAULT_MAX_MERGE_MB = 2048;
        

        this is not what I would expect from optimize() even if it would be documented that way. A plain optimize call should by default result in a single segment IMO. Yet, we could make this set by a flag in LogMergePolicy maybe something like LogMergePolicy#useMasMergeSizeForOptimize = false; as a default?

        Show
        Simon Willnauer added a comment - This change together with LUCENE-2773 caused a change of the IW#optimize() and friends semantics. IW#optimize() says: /** * Requests an "optimize" operation on an index, priming the index * for the fastest available search. Traditionally this has meant * merging all segments into a single segment as is done in the * default merge policy, but individual merge policies may implement * optimize in different ways. * Which is not entirely true anymore since default now uses /** Default maximum segment size. A segment of this size * or larger will never be merged. @see setMaxMergeMB */ public static final double DEFAULT_MAX_MERGE_MB = 2048; this is not what I would expect from optimize() even if it would be documented that way. A plain optimize call should by default result in a single segment IMO. Yet, we could make this set by a flag in LogMergePolicy maybe something like LogMergePolicy#useMasMergeSizeForOptimize = false; as a default?
        Hide
        Jason Rutherglen added a comment -

        I agree that there should not be a defaults for the max merge segment size for optimize, though it's good to have the option.

        Show
        Jason Rutherglen added a comment - I agree that there should not be a defaults for the max merge segment size for optimize, though it's good to have the option.
        Hide
        Shai Erera added a comment -

        I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it.

        I expect that if I use IW with a LMP that sets maxMergeMB, then even if I call optimize() this setting will take effect.

        BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in?

        Show
        Shai Erera added a comment - I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it. I expect that if I use IW with a LMP that sets maxMergeMB, then even if I call optimize() this setting will take effect. BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in?
        Hide
        Simon Willnauer added a comment -

        BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in?

        True this was done in here: LUCENE-2773 - but this seemed to be more related?!

        I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it.

        I think mike did that on purpose to prevent large segs from merging during indexing.... so what is wrong with disable that limit during optimize?

        Show
        Simon Willnauer added a comment - BTW, I don't remember introducin this defaul as part of this issue. This issue only changed LMP to take the already existed setting into account. So maybe reverting this default should be handled within the issue I was changed in? True this was done in here: LUCENE-2773 - but this seemed to be more related?! I don't think we need a useDefaultMaxMergeMb. Instead, we can default the member to Long.MAX_VAL. That way, if no one sets it, all segments will be considered for merge, and if one wants, he can set it. I think mike did that on purpose to prevent large segs from merging during indexing.... so what is wrong with disable that limit during optimize?
        Hide
        Michael McCandless added a comment -

        I think mike did that on purpose to prevent large segs from merging during indexing.

        Right – large merges are really quite nasty – mess up searches, NRT turnaround, IW.close() suddenly unexpectedly takes like an hour, etc.

        But, really the best fix, which I'd love to do at some point, is to fix our merge policy so that insanely large merges are done w/ fewer segments (eg only 2 segments at once). I think BalancedMP does this.

        Show
        Michael McCandless added a comment - I think mike did that on purpose to prevent large segs from merging during indexing. Right – large merges are really quite nasty – mess up searches, NRT turnaround, IW.close() suddenly unexpectedly takes like an hour, etc. But, really the best fix, which I'd love to do at some point, is to fix our merge policy so that insanely large merges are done w/ fewer segments (eg only 2 segments at once). I think BalancedMP does this.
        Hide
        Shai Erera added a comment -

        Patch fixes some jdocs + adds a separate setting for maxMergeSizeForOptimize which is applied only during optimize().

        Show
        Shai Erera added a comment - Patch fixes some jdocs + adds a separate setting for maxMergeSizeForOptimize which is applied only during optimize().
        Hide
        Simon Willnauer added a comment -

        Patch fixes some jdocs + adds a separate setting for maxMergeSizeForOptimize which is applied only during optimize().

        patch looks good to me! +1 to commit

        thanks shai

        Show
        Simon Willnauer added a comment - Patch fixes some jdocs + adds a separate setting for maxMergeSizeForOptimize which is applied only during optimize(). patch looks good to me! +1 to commit thanks shai
        Hide
        Shai Erera added a comment -

        Thanks Simon !

        Committed revision 1059750 (3x).
        Committed revision 1059751 (trunk).

        Show
        Shai Erera added a comment - Thanks Simon ! Committed revision 1059750 (3x). Committed revision 1059751 (trunk).
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development