Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-528

Optimization for IndexWriter.addIndexes()

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      One big performance problem with IndexWriter.addIndexes() is that it has to optimize the index both before and after adding the segments. When you have a very large index, to which you are adding batches of small updates, these calls to optimize make using addIndexes() impossible. It makes parallel updates very frustrating.

      Here is an optimized function that helps out by calling mergeSegments only on the newly added documents. It will try to avoid calling mergeSegments until the end, unless you're adding a lot of documents at once.

      I also have an extensive unit test that verifies that this function works correctly if people are interested. I gave it a different name because it has very different performance characteristics which can make querying take longer.

      1. AddIndexes.patch
        9 kB
        Steven Tamm
      2. AddIndexesNoOptimize.patch
        20 kB
        Ning Li

        Activity

        Hide
        ninglili Ning Li added a comment -

        In an email thread titled "LUCENE-528 and 565", I described a weakness of the proposed solution:

        "I'm totally for a version of addIndexes() where optimize() is not always called. However, with the one proposed in the patch, we could end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. while Lucene desires the reverse. Or we could have a sandwich index where: segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of these will occur if you use addIndexesNoOpt() carefully, there should be a more robust merge policy."

        Here is an alternative solution which merges segements so that the docCount of segment i is at least twice as big as the docCount of segment i+1. If we are willing to make it a bit more complicated, we can take merge factor into consideration.

        public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException {
        for (int i = 0; i < dirs.length; i++) {
        SegmentInfos sis = new SegmentInfos(); // read infos from dir
        sis.read(dirs[i]);
        for (int j = 0; j < sis.size(); j++)

        { segmentInfos.addElement(sis.info(j)); // add each info }

        }

        int start = 0;
        int docCountFromStart = docCount();

        while (start < segmentInfos.size()) {
        int end;
        int docCountToMerge = 0;

        if (docCountFromStart <= minMergeDocs)

        { // if the total docCount of the remaining segments // is lte minMergeDocs, merge all of them end = segmentInfos.size() - 1; docCountToMerge = docCountFromStart; }

        else {
        // otherwise, merge some segments so that the docCount
        // of these segments is at least half of the remaining
        for (end = start; end < segmentInfos.size(); end++) {
        docCountToMerge += segmentInfos.info(end).docCount;
        if (docCountToMerge >= docCountFromStart / 2)

        { break; }

        }
        }

        mergeSegments(start, end + 1);
        start++;
        docCountFromStart -= docCountToMerge;
        }
        }

        Show
        ninglili Ning Li added a comment - In an email thread titled " LUCENE-528 and 565", I described a weakness of the proposed solution: "I'm totally for a version of addIndexes() where optimize() is not always called. However, with the one proposed in the patch, we could end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. while Lucene desires the reverse. Or we could have a sandwich index where: segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of these will occur if you use addIndexesNoOpt() carefully, there should be a more robust merge policy." Here is an alternative solution which merges segements so that the docCount of segment i is at least twice as big as the docCount of segment i+1. If we are willing to make it a bit more complicated, we can take merge factor into consideration. public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException { for (int i = 0; i < dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs [i] ); for (int j = 0; j < sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } int start = 0; int docCountFromStart = docCount(); while (start < segmentInfos.size()) { int end; int docCountToMerge = 0; if (docCountFromStart <= minMergeDocs) { // if the total docCount of the remaining segments // is lte minMergeDocs, merge all of them end = segmentInfos.size() - 1; docCountToMerge = docCountFromStart; } else { // otherwise, merge some segments so that the docCount // of these segments is at least half of the remaining for (end = start; end < segmentInfos.size(); end++) { docCountToMerge += segmentInfos.info(end).docCount; if (docCountToMerge >= docCountFromStart / 2) { break; } } } mergeSegments(start, end + 1); start++; docCountFromStart -= docCountToMerge; } }
        Hide
        ninglili Ning Li added a comment -

        We want a robust algorithm for the version of addIndexes() which
        does not call optimize().

        The robustness can be expressed as the two invariants guaranteed
        by the merge policy for adding documents (if mergeFactor M does not
        change and segment doc count is not reaching maxMergeDocs):
        B for maxBufferedDocs, f defined as ceil(log_M(ceil(n/B)))
        1: If i (left*) and i+1 (right*) are two consecutive segments of doc
        counts x and y, then f >= f.
        2: The number of committed segments on the same level (f) <= M.

        References are at http://www.gossamer-threads.com/lists/lucene/java-dev/35147,
        LUCENE-565 and LUCENE-672.

        AddIndexes() can be viewed as adding a sequence of segments S to
        a sequence of segments T. Segments in T follow the invariants but
        segments in S may not since they could come from multiple indexes.
        Here is the merge algorithm for addIndexes():

        1. Flush ram segments.

        2. Consider a combined sequence with segments from T followed
        by segments from S (same as current addIndexes()).

        3. Assume the highest level for segments in S is h. Call maybeMergeSegments(),
        but instead of starting w/ lowerBound = -1 and upperBound = maxBufferedDocs,
        start w/ lowerBound = -1 and upperBound = upperBound of level h.
        After this, the invariants are guaranteed except for the last < M segments
        whose levels <= h.

        4. If the invariants hold for the last < M segments whose levels <= h, done.
        Otherwise, simply merge those segments. If the merge results in
        a segment of level <= h, done. Otherwise, it's of level h+1 and call
        maybeMergeSegments() starting w/ upperBound = upperBound of level h+1.

        Suggestions?

        Show
        ninglili Ning Li added a comment - We want a robust algorithm for the version of addIndexes() which does not call optimize(). The robustness can be expressed as the two invariants guaranteed by the merge policy for adding documents (if mergeFactor M does not change and segment doc count is not reaching maxMergeDocs): B for maxBufferedDocs, f defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f >= f . 2: The number of committed segments on the same level (f ) <= M. References are at http://www.gossamer-threads.com/lists/lucene/java-dev/35147 , LUCENE-565 and LUCENE-672 . AddIndexes() can be viewed as adding a sequence of segments S to a sequence of segments T. Segments in T follow the invariants but segments in S may not since they could come from multiple indexes. Here is the merge algorithm for addIndexes(): 1. Flush ram segments. 2. Consider a combined sequence with segments from T followed by segments from S (same as current addIndexes()). 3. Assume the highest level for segments in S is h. Call maybeMergeSegments(), but instead of starting w/ lowerBound = -1 and upperBound = maxBufferedDocs, start w/ lowerBound = -1 and upperBound = upperBound of level h. After this, the invariants are guaranteed except for the last < M segments whose levels <= h. 4. If the invariants hold for the last < M segments whose levels <= h, done. Otherwise, simply merge those segments. If the merge results in a segment of level <= h, done. Otherwise, it's of level h+1 and call maybeMergeSegments() starting w/ upperBound = upperBound of level h+1. Suggestions?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I think you need to ensure that no segments from the source index "S" remain after the call, right?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I think you need to ensure that no segments from the source index "S" remain after the call, right?
        Hide
        ninglili Ning Li added a comment -

        > I think you need to ensure that no segments from the source index "S" remain after the call, right?

        Correct. And thanks!

        So in step 4, in the case where the invariants hold for the last < M segments whose levels <= h,
        if some of those < M segments are from S (not merged in step 3), properly copy them over.

        Algorithm looks good?

        This makes me notice a bug in current addIndexes(Directory[]). In current addIndexes(Directory[]),
        segment infos in S are added to T's "segmentInfos" upfront. Then segments in S are merged to T
        several at a time. Every merge is committed with T's "segmentInfos". So if a reader is opened on T
        while addIndexes() is going on, it could see an inconsistent index.

        Show
        ninglili Ning Li added a comment - > I think you need to ensure that no segments from the source index "S" remain after the call, right? Correct. And thanks! So in step 4, in the case where the invariants hold for the last < M segments whose levels <= h, if some of those < M segments are from S (not merged in step 3), properly copy them over. Algorithm looks good? This makes me notice a bug in current addIndexes(Directory[]). In current addIndexes(Directory[]), segment infos in S are added to T's "segmentInfos" upfront. Then segments in S are merged to T several at a time. Every merge is committed with T's "segmentInfos". So if a reader is opened on T while addIndexes() is going on, it could see an inconsistent index.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        > Algorithm looks good?

        Yep, looks good.

        A possible optimization far down the road for segments that need copying could be to "steal" the segments if rename/mv is supported (if they are both of the same Directory type, on the same partition for FSDirectories, etc), and if the caller indicated it was OK. Rather than try to detect when it's safe, there could be an attachSegments(Directory[]) call.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - > Algorithm looks good? Yep, looks good. A possible optimization far down the road for segments that need copying could be to "steal" the segments if rename/mv is supported (if they are both of the same Directory type, on the same partition for FSDirectories, etc), and if the caller indicated it was OK. Rather than try to detect when it's safe, there could be an attachSegments(Directory[]) call.
        Hide
        ninglili Ning Li added a comment -

        I'll submit a patch next week.

        Show
        ninglili Ning Li added a comment - I'll submit a patch next week.
        Hide
        ninglili Ning Li added a comment -

        This patch implements addIndexesNoOptimize() following the algorithm described earlier.

        • The patch is based on the latest version from trunk.
        • AddIndexesNoOptimize() is implemented. The algorithm description is included as comment and the code is commented.
        • The patch includes a test called TestAddIndexesNoOptimize which covers all the code in addIndexesNoOptimize().
        • maybeMergeSegments() was conservative and checked for more merges only when "upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges when "upperBound < maxMergeDocs".
        • Minor changes in TestIndexWriterMergePolicy to better verify merge invariants.
        • The patch passes all unit tests.

        One more comment on the implementation:

        • When we copy un-merged segments from S in step 4, ideally, we want to simply copy
          those segments. However, directory does not support copy yet. In addition, source may
          use compound file or not and target may use compound file or not. So we use
          mergeSegments() to copy each segment, which may cause doc count to change
          because deleted docs are garbage collected. That case is handled properly.
        Show
        ninglili Ning Li added a comment - This patch implements addIndexesNoOptimize() following the algorithm described earlier. The patch is based on the latest version from trunk. AddIndexesNoOptimize() is implemented. The algorithm description is included as comment and the code is commented. The patch includes a test called TestAddIndexesNoOptimize which covers all the code in addIndexesNoOptimize(). maybeMergeSegments() was conservative and checked for more merges only when "upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges when "upperBound < maxMergeDocs". Minor changes in TestIndexWriterMergePolicy to better verify merge invariants. The patch passes all unit tests. One more comment on the implementation: When we copy un-merged segments from S in step 4, ideally, we want to simply copy those segments. However, directory does not support copy yet. In addition, source may use compound file or not and target may use compound file or not. So we use mergeSegments() to copy each segment, which may cause doc count to change because deleted docs are garbage collected. That case is handled properly.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Thanks Ning, I just committed this.
        Wish I had had it some time ago
        http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg10758.html

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Thanks Ning, I just committed this. Wish I had had it some time ago http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg10758.html

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            tamm Steven Tamm
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development