Lucene - Core
  1. Lucene - Core
  2. LUCENE-1634

LogMergePolicy should use the number of deleted docs when deciding which segments to merge

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.

      I created an index with 1 million docs, then went over all docs and updated a few thousand at a time. I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.

      This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

      1. LUCENE-1634.patch
        2 kB
        Yasuhiro Matsuda
      2. LUCENE-1634.patch
        4 kB
        Yasuhiro Matsuda

        Activity

        Hide
        Yasuhiro Matsuda added a comment -

        I posted a patch.

        Show
        Yasuhiro Matsuda added a comment - I posted a patch.
        Hide
        Michael McCandless added a comment -

        Actually, optimize() always merges all segments down to 1,
        irrespective of deletes. I think you're referring to "normal" merges?

        I think this approach is reasonable and a good step forward. Can you
        make this behaviour get/settable? I think we should default to the
        old behaviour until 3.0, and then switch it to default to the new one
        at 3.0 (to preserve back compat).

        Thinking more about this... I think we can further improve how
        Log*MergePolicy takes deletes into account. Ie, why not explicitly
        measure the deletions and bias merge selection to favor merging away
        segments that have the most deletions?

        This might require relaxing the merge policy so that it's allowed to
        pick fewer than mergeFactor segments to merge at once; perhaps it's
        given a min/max mergeFactor.

        Likely such a change should be a new merge policy...

        Show
        Michael McCandless added a comment - Actually, optimize() always merges all segments down to 1, irrespective of deletes. I think you're referring to "normal" merges? I think this approach is reasonable and a good step forward. Can you make this behaviour get/settable? I think we should default to the old behaviour until 3.0, and then switch it to default to the new one at 3.0 (to preserve back compat). Thinking more about this... I think we can further improve how Log*MergePolicy takes deletes into account. Ie, why not explicitly measure the deletions and bias merge selection to favor merging away segments that have the most deletions? This might require relaxing the merge policy so that it's allowed to pick fewer than mergeFactor segments to merge at once; perhaps it's given a min/max mergeFactor. Likely such a change should be a new merge policy...
        Hide
        John Wang added a comment -

        This is actually referring to the optimize(int) call, which selectively merges segments to make sure the total segments in the index is less or equal to the specified number.

        Show
        John Wang added a comment - This is actually referring to the optimize(int) call, which selectively merges segments to make sure the total segments in the index is less or equal to the specified number.
        Hide
        John Wang added a comment -

        Comment on implementing a custom merge policy:
        As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy. And one cannot subclass LogMergePolicy without injecting the class into the org.apache.lucene.index package, (because the api signature: size(org.apache.lucene.index.SegmentInfo info), SegmentInfo is not an exposed API.

        Show
        John Wang added a comment - Comment on implementing a custom merge policy: As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy. And one cannot subclass LogMergePolicy without injecting the class into the org.apache.lucene.index package, (because the api signature: size(org.apache.lucene.index.SegmentInfo info), SegmentInfo is not an exposed API.
        Hide
        Michael McCandless added a comment -

        This is actually referring to the optimize(int) call

        Ahh, woops – sorry, I missed that the first time around. But I don't think the patch addresses how optimize(int) selects its merges?

        Show
        Michael McCandless added a comment - This is actually referring to the optimize(int) call Ahh, woops – sorry, I missed that the first time around. But I don't think the patch addresses how optimize(int) selects its merges?
        Hide
        Michael McCandless added a comment -

        As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy.

        Hmm – that's not intended. What goes wrong if you use a custom MergePolicy that's not a LogMergePolicy?

        (There are utility methods on IndexWriter, eg set/getUseCompoundFile, etc., that will only work if your MergePolicy is a LogMergePolicy, but that shouldn't prevent you from using a non-LogMergePolicy as long as you don't use these utility methods).

        Show
        Michael McCandless added a comment - As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy. Hmm – that's not intended. What goes wrong if you use a custom MergePolicy that's not a LogMergePolicy? (There are utility methods on IndexWriter, eg set/getUseCompoundFile, etc., that will only work if your MergePolicy is a LogMergePolicy, but that shouldn't prevent you from using a non-LogMergePolicy as long as you don't use these utility methods).
        Hide
        John Wang added a comment -

        The current lucene implementation, optimize(int) selects segments to merge based on the file size of the segment file: say the index has 10 segments, and optmize(6) is called, Lucene finds 4 smallest segments by number of bytes in the segment files.

        This selection criteria is flawed because you can have a very large segment in terms of bytes but very small in terms of numDocs (if many deleted docs). Having these segment files around impacts performance considerably.

        This is what this patch is trying to fix this in a non-intrusive manner by extending the LogMergePolicy and by normalizing the calculation of the segment size including the delete count.

        Show
        John Wang added a comment - The current lucene implementation, optimize(int) selects segments to merge based on the file size of the segment file: say the index has 10 segments, and optmize(6) is called, Lucene finds 4 smallest segments by number of bytes in the segment files. This selection criteria is flawed because you can have a very large segment in terms of bytes but very small in terms of numDocs (if many deleted docs). Having these segment files around impacts performance considerably. This is what this patch is trying to fix this in a non-intrusive manner by extending the LogMergePolicy and by normalizing the calculation of the segment size including the delete count.
        Hide
        Michael McCandless added a comment -

        say the index has 10 segments, and optmize(6) is called, Lucene finds 4 smallest segments by number of bytes in the segment files.

        Oh yeah, you're right: for the last (partial) merge, it uses the size() of the segments to pick the merge that's minimal total size.

        So let's proceed with this patch, once you've added setter/getter.

        I'd still love to see a merge policy that does a better job explicitly "targeting" segments with lots of deletions.

        Show
        Michael McCandless added a comment - say the index has 10 segments, and optmize(6) is called, Lucene finds 4 smallest segments by number of bytes in the segment files. Oh yeah, you're right: for the last (partial) merge, it uses the size() of the segments to pick the merge that's minimal total size. So let's proceed with this patch, once you've added setter/getter. I'd still love to see a merge policy that does a better job explicitly "targeting" segments with lots of deletions.
        Hide
        John Wang added a comment -

        RE: implementing custom MergePolicy
        Let me describe in detail on problems of implementing a custom MergePolicy:

        1) In IndexWriter code, such methods on MergePolicy is called, e.g. findMergesForOptimize. I believe that is the contract for implementing your own MergePolicy. However, it is "hidden" by the javadoc in terms of documentation, and furthermore, it is hidden because these methods are package protected. So to implement your own MergePolicy, you have to resort back to sneaking the class into the package.

        2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy. (Just to clarify, useCompoundFile is a package-level protected method on the base MergePolicy class, so my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy.

        This brings up another issue about the practice of having to "sneak" classes into a package. We are looking at making our Lucene code, OSGI compliant, and this becomes an issue because we cannot have multiple "bundles" exporting the same package. Which means, I would have to repackage lucene to include my classes that I have snuck into some lucene packages. I would like to use a standard distribution of a lucene jar (as suggested/echoed by some luceners).

        Show
        John Wang added a comment - RE: implementing custom MergePolicy Let me describe in detail on problems of implementing a custom MergePolicy: 1) In IndexWriter code, such methods on MergePolicy is called, e.g. findMergesForOptimize. I believe that is the contract for implementing your own MergePolicy. However, it is "hidden" by the javadoc in terms of documentation, and furthermore, it is hidden because these methods are package protected. So to implement your own MergePolicy, you have to resort back to sneaking the class into the package. 2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy. (Just to clarify, useCompoundFile is a package-level protected method on the base MergePolicy class, so my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy. This brings up another issue about the practice of having to "sneak" classes into a package. We are looking at making our Lucene code, OSGI compliant, and this becomes an issue because we cannot have multiple "bundles" exporting the same package. Which means, I would have to repackage lucene to include my classes that I have snuck into some lucene packages. I would like to use a standard distribution of a lucene jar (as suggested/echoed by some luceners).
        Hide
        John Wang added a comment -

        >> So let's proceed with this patch, once you've added setter/getter.
        Can you please elaborate on this? Add setter/getter on what? The number of target segment is already an input parameter? do you mean some sort of normalization factor on the how much to "punish" segments with high deleted docs?

        Show
        John Wang added a comment - >> So let's proceed with this patch, once you've added setter/getter. Can you please elaborate on this? Add setter/getter on what? The number of target segment is already an input parameter? do you mean some sort of normalization factor on the how much to "punish" segments with high deleted docs?
        Hide
        Michael McCandless added a comment -

        I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy.

        Show
        Michael McCandless added a comment - I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy.
        Hide
        Michael McCandless added a comment -

        So to implement your own MergePolicy, you have to resort back to sneaking the class into the package.

        Right, this is currently necessary for a custom MergePolicy/Scheduler. It's been discussed before:

        http://www.nabble.com/MergePolicy-public-but-SegmentInfos-package-protected--tt22687527.html

        I suppose since merge selection needs so little info about a segment, we could make a public thine wrapper/veneer that exposes a limited number of things. Or maybe we go whole hog and simply make SegmentInfos/SegmentInfo public.

        it is hidden because these methods are package protected

        If we could javadoc certain package protected classes, that'd give you the javadocs at least. Or we should simply make these methods public?

        Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy.

        But the notion of a mergeFactor is very much a LogMergePolicy specific thing. Other merge policies might not limit themselves to always merging mergeFactor segments at once. These are convenience methods on IndexWriter (that simply forward the request to the MergePolicy).

        my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy

        I think that'd make sense. (I can't remember exactly why, but way back when, I think there was some reason for not doing so...)

        Show
        Michael McCandless added a comment - So to implement your own MergePolicy, you have to resort back to sneaking the class into the package. Right, this is currently necessary for a custom MergePolicy/Scheduler. It's been discussed before: http://www.nabble.com/MergePolicy-public-but-SegmentInfos-package-protected--tt22687527.html I suppose since merge selection needs so little info about a segment, we could make a public thine wrapper/veneer that exposes a limited number of things. Or maybe we go whole hog and simply make SegmentInfos/SegmentInfo public. it is hidden because these methods are package protected If we could javadoc certain package protected classes, that'd give you the javadocs at least. Or we should simply make these methods public? Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy. But the notion of a mergeFactor is very much a LogMergePolicy specific thing. Other merge policies might not limit themselves to always merging mergeFactor segments at once. These are convenience methods on IndexWriter (that simply forward the request to the MergePolicy). my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy I think that'd make sense. (I can't remember exactly why, but way back when, I think there was some reason for not doing so...)
        Hide
        John Wang added a comment -

        >>I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy.
        Makes sense. What do you suggest the default behavior to be?
        Also, do you think setter/getter is the right approach, since this is very much hidden to the API, e.g. one would have to do this:

        LogMergePolicy policy = (LogMergePolicy)idxWriter.getMergePolicy();
        policy.setTurnOnSegmentCalcWithDeletes(true);

        DO you think instead, we can just add static setter/getter on LogMergePolicy class?

        Show
        John Wang added a comment - >>I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy. Makes sense. What do you suggest the default behavior to be? Also, do you think setter/getter is the right approach, since this is very much hidden to the API, e.g. one would have to do this: LogMergePolicy policy = (LogMergePolicy)idxWriter.getMergePolicy(); policy.setTurnOnSegmentCalcWithDeletes(true); DO you think instead, we can just add static setter/getter on LogMergePolicy class?
        Hide
        Michael McCandless added a comment -

        What do you suggest the default behavior to be?

        I think default to no change, and in 3.0 we flip it?

        DO you think instead, we can just add static setter/getter on LogMergePolicy class?

        I'd lean towards non-static setter/getters.

        Show
        Michael McCandless added a comment - What do you suggest the default behavior to be? I think default to no change, and in 3.0 we flip it? DO you think instead, we can just add static setter/getter on LogMergePolicy class? I'd lean towards non-static setter/getters.
        Hide
        Yasuhiro Matsuda added a comment -

        I added setCalibrateSizeByDeletes/getCalibrateSizeByDeletes. Sorry that the method names are a bit too wordy.

        Show
        Yasuhiro Matsuda added a comment - I added setCalibrateSizeByDeletes/getCalibrateSizeByDeletes. Sorry that the method names are a bit too wordy.
        Hide
        Michael McCandless added a comment -

        Patch looks good, thanks Yasuhiro. I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - Patch looks good, thanks Yasuhiro. I plan to commit in a day or two.
        Hide
        Michael McCandless added a comment -

        Thanks Yasuhiro!

        Show
        Michael McCandless added a comment - Thanks Yasuhiro!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Yasuhiro Matsuda
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development