Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 3.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If an index has multiple segments, this tool allows splitting those segments into separate directories.

      1. mp-splitter-inline.patch
        13 kB
        Uwe Schindler
      2. mp-splitter5.patch
        15 kB
        Koji Sekiguchi
      3. mp-splitter4.patch
        15 kB
        Andrzej Bialecki
      4. mp-splitter3.patch
        14 kB
        Andrzej Bialecki
      5. mp-splitter2.patch
        13 kB
        Andrzej Bialecki
      6. mp-splitter.patch
        13 kB
        Andrzej Bialecki
      7. LUCENE-1959.patch
        8 kB
        Jason Rutherglen
      8. LUCENE-1959.patch
        10 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Thanks Andrzej!

        Show
        Michael McCandless added a comment - Thanks Andrzej!
        Hide
        Andrzej Bialecki added a comment -

        Committed revision 824798.

        Show
        Andrzej Bialecki added a comment - Committed revision 824798.
        Hide
        Andrzej Bialecki added a comment -

        Indeed, thanks for the fix - I'll commit this.

        Show
        Andrzej Bialecki added a comment - Indeed, thanks for the fix - I'll commit this.
        Hide
        Koji Sekiguchi added a comment - - edited

        I added small fix. If we have 13 docs (docid=0,1,2,...,12) and numParts=3, 12th doc is missing with -seq mode. I changed this:

        // above range
        for (int j = hi; j < maxDoc; j++) {
           input.deleteDocument(j);
        }
        

        to:

        // above range
        if( i < numParts - 1 ){
          for (int j = hi; j < maxDoc; j++) {
             input.deleteDocument(j);
          }
        }
        
        Show
        Koji Sekiguchi added a comment - - edited I added small fix. If we have 13 docs (docid=0,1,2,...,12) and numParts=3, 12th doc is missing with -seq mode. I changed this: // above range for ( int j = hi; j < maxDoc; j++) { input.deleteDocument(j); } to: // above range if ( i < numParts - 1 ){ for ( int j = hi; j < maxDoc; j++) { input.deleteDocument(j); } }
        Hide
        Andrzej Bialecki added a comment -

        I moved the files in this patch to contrib/misc and updated the contrib/CHANGES.txt. If there are no objections I'll commit it soon.

        Show
        Andrzej Bialecki added a comment - I moved the files in this patch to contrib/misc and updated the contrib/CHANGES.txt. If there are no objections I'll commit it soon.
        Hide
        Michael McCandless added a comment -

        Good progress! Andrzej, how about you go ahead & commit yourself?

        Show
        Michael McCandless added a comment - Good progress! Andrzej, how about you go ahead & commit yourself?
        Hide
        Andrzej Bialecki added a comment -

        As suggested by Uwe, don't allocate the old deletions bitset if there are no deletions.

        Show
        Andrzej Bialecki added a comment - As suggested by Uwe, don't allocate the old deletions bitset if there are no deletions.
        Hide
        Andrzej Bialecki added a comment -

        The test passed in Eclipse only - "ant test" ran from cmdline didn't pass without this fix, so I suspect my Eclipse is to blame for hiding the problem. Re: lazy allocation of bitset - good point, I'll make this change.

        Show
        Andrzej Bialecki added a comment - The test passed in Eclipse only - "ant test" ran from cmdline didn't pass without this fix, so I suspect my Eclipse is to blame for hiding the problem. Re: lazy allocation of bitset - good point, I'll make this change.
        Hide
        Uwe Schindler added a comment -

        Ah ok, I didn't look into the test failure yesterday (was too late in the evening), I only wanted to make a quick design and if it would generally work.
        But you are right, the numDocs() return value is incorrect, leading to a failure in this test. But as the test pass in your test environment, the assertion in the SegmentMerger seems not important for functionality. So in general my code and your first code would work correct. I do not know how costly the initial building of the BitSet used for the input reader's deleted docs is, but one possibility would be to only build/use the additional bitset, if hasDeletions() on the original index returns true.

        Thanks for clarifying.

        Show
        Uwe Schindler added a comment - Ah ok, I didn't look into the test failure yesterday (was too late in the evening), I only wanted to make a quick design and if it would generally work. But you are right, the numDocs() return value is incorrect, leading to a failure in this test. But as the test pass in your test environment, the assertion in the SegmentMerger seems not important for functionality. So in general my code and your first code would work correct. I do not know how costly the initial building of the BitSet used for the input reader's deleted docs is, but one possibility would be to only build/use the additional bitset, if hasDeletions() on the original index returns true. Thanks for clarifying.
        Hide
        Andrzej Bialecki added a comment -

        Now, the mystery is why this test passed when executed in Eclipse - that assert should've tripped then as well. I remember now why I used bitsets - we actually need to know the number of deleted docs to return proper value in IR.numDocs(), and this value is not easy to calculate without actually doing this intersection. Your version looked more elegant, but it still tripped that assert (for good reason). I fixed my version so that it passes the tests when executed through ant (and it still passes in Eclipse, huh .. ).

        Show
        Andrzej Bialecki added a comment - Now, the mystery is why this test passed when executed in Eclipse - that assert should've tripped then as well. I remember now why I used bitsets - we actually need to know the number of deleted docs to return proper value in IR.numDocs(), and this value is not easy to calculate without actually doing this intersection. Your version looked more elegant, but it still tripped that assert (for good reason). I fixed my version so that it passes the tests when executed through ant (and it still passes in Eclipse, huh .. ).
        Hide
        Jason Rutherglen added a comment -

        I'm using the IndexSplitter to divide a 100GB index into roughly equals parts and deploying into production. Then will clean up the patch.

        I'm not sure why we'd want to use FSDir to copy files as the input parameters are filesystem paths?

        Show
        Jason Rutherglen added a comment - I'm using the IndexSplitter to divide a 100GB index into roughly equals parts and deploying into production. Then will clean up the patch. I'm not sure why we'd want to use FSDir to copy files as the input parameters are filesystem paths?
        Hide
        Uwe Schindler added a comment -

        Here is my inline version without OpenBitSet. The test results are the same (does not pass at same position), but shows, what I meant.

        Patch contains contrib/misc path prefix.

        Show
        Uwe Schindler added a comment - Here is my inline version without OpenBitSet. The test results are the same (does not pass at same position), but shows, what I meant. Patch contains contrib/misc path prefix.
        Hide
        Uwe Schindler added a comment -

        Test fails here (I applied the patch to contrib/misc, but that should be no difference):

            [junit]
            [junit] Testsuite: org.apache.lucene.index.TestMultiPassIndexSplitter
            [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 1,11 sec
            [junit]
            [junit] ------------- Standard Error -----------------
            [junit] Writing part 1 ...
            [junit] Writing part 1 ...
            [junit] ------------- ---------------- ---------------
            [junit] Testcase: testSplitRR(org.apache.lucene.index.TestMultiPassIndexSpli
        tter):  Caused an ERROR
            [junit] null
            [junit] java.lang.AssertionError
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM
        erger.java:600)
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge
        r.java:571)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:152)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:128)
            [junit]     at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja
        va:3367)
            [junit]     at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas
        sIndexSplitter.java:92)
            [junit]     at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitR
        R(TestMultiPassIndexSplitter.java:60)
            [junit]
            [junit]
            [junit] Testcase: testSplitSeq(org.apache.lucene.index.TestMultiPassIndexSpl
        itter): Caused an ERROR
            [junit] null
            [junit] java.lang.AssertionError
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM
        erger.java:600)
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge
        r.java:571)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:152)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:128)
            [junit]     at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja
        va:3367)
            [junit]     at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas
        sIndexSplitter.java:92)
            [junit]     at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitS
        eq(TestMultiPassIndexSplitter.java:102)
            [junit]
            [junit]
            [junit] Test org.apache.lucene.index.TestMultiPassIndexSplitter FAILED
            [junit] Testsuite: org.apache.lucene.index.TestTermVectorAccessor
            [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1,079 sec
        
        Show
        Uwe Schindler added a comment - Test fails here (I applied the patch to contrib/misc, but that should be no difference): [junit] [junit] Testsuite: org.apache.lucene.index.TestMultiPassIndexSplitter [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 1,11 sec [junit] [junit] ------------- Standard Error ----------------- [junit] Writing part 1 ... [junit] Writing part 1 ... [junit] ------------- ---------------- --------------- [junit] Testcase: testSplitRR(org.apache.lucene.index.TestMultiPassIndexSpli tter): Caused an ERROR [junit] null [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM erger.java:600) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge r.java:571) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:152) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:128) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja va:3367) [junit] at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas sIndexSplitter.java:92) [junit] at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitR R(TestMultiPassIndexSplitter.java:60) [junit] [junit] [junit] Testcase: testSplitSeq(org.apache.lucene.index.TestMultiPassIndexSpl itter): Caused an ERROR [junit] null [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM erger.java:600) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge r.java:571) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:152) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:128) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja va:3367) [junit] at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas sIndexSplitter.java:92) [junit] at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitS eq(TestMultiPassIndexSplitter.java:102) [junit] [junit] [junit] Test org.apache.lucene.index.TestMultiPassIndexSplitter FAILED [junit] Testsuite: org.apache.lucene.index.TestTermVectorAccessor [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1,079 sec
        Hide
        Uwe Schindler added a comment -

        Small optimization for this one: You even do not need a bitset or explicite delete/undelete operations, it can be done inline. Just put the logic into isDeleted() [e.g. modulo or range comparison] and let the TermPositions also check isDeleted().

        Show
        Uwe Schindler added a comment - Small optimization for this one: You even do not need a bitset or explicite delete/undelete operations, it can be done inline. Just put the logic into isDeleted() [e.g. modulo or range comparison] and let the TermPositions also check isDeleted().
        Hide
        Michael McCandless added a comment -

        Excellent!

        Show
        Michael McCandless added a comment - Excellent!
        Hide
        Uwe Schindler added a comment -

        Really cool!

        Show
        Uwe Schindler added a comment - Really cool!
        Hide
        Mark Miller added a comment -

        Nice! Lets add it to the mix - I'm guessing Jason's is quite a bit faster for splitting segs and this one nicer in that it can split indivd segs. Do we keep two tools or merge them into one with options?

        Show
        Mark Miller added a comment - Nice! Lets add it to the mix - I'm guessing Jason's is quite a bit faster for splitting segs and this one nicer in that it can split indivd segs. Do we keep two tools or merge them into one with options?
        Hide
        Andrzej Bialecki added a comment -

        Here's my submission to the index splitting race This version implements the multi-pass method that uses loops of delete/addIndexes/undelete.

        Show
        Andrzej Bialecki added a comment - Here's my submission to the index splitting race This version implements the multi-pass method that uses loops of delete/addIndexes/undelete.
        Hide
        Michael McCandless added a comment -

        Thanks Jason!

        Show
        Michael McCandless added a comment - Thanks Jason!
        Hide
        Michael McCandless added a comment -

        small opt - you might switch it to reuse the buffer between files.

        OK I just committed that!

        Show
        Michael McCandless added a comment - small opt - you might switch it to reuse the buffer between files. OK I just committed that!
        Hide
        Mark Miller added a comment -

        small opt - you might switch it to reuse the buffer between files.

        Show
        Mark Miller added a comment - small opt - you might switch it to reuse the buffer between files.
        Hide
        Michael McCandless added a comment -

        New patch attached: move to contrib/misc, renamed TestFileSplitter -> TestIndexSplitter, added javadocs noting the limitations, added CHANGES entry. I'll commit shortly.

        Show
        Michael McCandless added a comment - New patch attached: move to contrib/misc, renamed TestFileSplitter -> TestIndexSplitter, added javadocs noting the limitations, added CHANGES entry. I'll commit shortly.
        Hide
        Michael McCandless added a comment -

        No reason not to start somewhere though.

        +1

        Progress not perfection!

        Show
        Michael McCandless added a comment - No reason not to start somewhere though. +1 Progress not perfection!
        Hide
        Mark Miller added a comment -

        So I guess I'm -0 on this index splitting method, because I think we can do it better.

        Improvements welcome No reason not to start somewhere though.

        Show
        Mark Miller added a comment - So I guess I'm -0 on this index splitting method, because I think we can do it better. Improvements welcome No reason not to start somewhere though.
        Hide
        Mark Miller added a comment -

        To copy the files it should use the directory abstraction lay and not use java.io directly.

        I'd use Channels instead - generally much faster.

        Show
        Mark Miller added a comment - To copy the files it should use the directory abstraction lay and not use java.io directly. I'd use Channels instead - generally much faster.
        Hide
        Michael McCandless added a comment -

        I would put it into contrib

        +1, I'll do that.

        To copy the files it should use the directory abstraction lay and not use java.io directly.

        I agree, that'd be nice, but I don't think really necessary before committing... it can be another future improvement. But, we should not the limitations of the tool; I'll add javadocs.

        Jason do you want to address any of these issues now (before committing to contrib)?

        Show
        Michael McCandless added a comment - I would put it into contrib +1, I'll do that. To copy the files it should use the directory abstraction lay and not use java.io directly. I agree, that'd be nice, but I don't think really necessary before committing... it can be another future improvement. But, we should not the limitations of the tool; I'll add javadocs. Jason do you want to address any of these issues now (before committing to contrib)?
        Hide
        Uwe Schindler added a comment - - edited

        I would put it into contrib (misc next to IndexNormModifier which is also command line), as it is a utility tool. I see no real reason to have it in core. We have then all flexibility to change and optimize it, as Andrzey suggested.

        One thing against this tool in its current form: To copy the files it should use the directory abstraction lay and not use java.io directly. So open IndexInput/IndexOutput to copy the files.

        Show
        Uwe Schindler added a comment - - edited I would put it into contrib (misc next to IndexNormModifier which is also command line), as it is a utility tool. I see no real reason to have it in core. We have then all flexibility to change and optimize it, as Andrzey suggested. One thing against this tool in its current form: To copy the files it should use the directory abstraction lay and not use java.io directly. So open IndexInput/IndexOutput to copy the files.
        Hide
        Andrzej Bialecki added a comment -

        I'm of a split mind about this splitter in the sense that I'm not sure how useful it is - if your input is an optimized index then it has just 1 segment, so this tool won't be able to split it, right?

        AFAIK a similar functionality can be implemented also using two other methods that would work on indexes with any number of segments: one method is trivial, based on a "delete/IndexWriter.addIndexes/undeletAll" loop that requires multiple passes over input data, the other would use the same method as SegmentMerger uses, i.e. working with FieldsWriter, FormatPostings*Consumer, TermVectorsWriter, etc. for a single-pass splitting.

        So I guess I'm -0 on this index splitting method, because I think we can do it better.

        Show
        Andrzej Bialecki added a comment - I'm of a split mind about this splitter in the sense that I'm not sure how useful it is - if your input is an optimized index then it has just 1 segment, so this tool won't be able to split it, right? AFAIK a similar functionality can be implemented also using two other methods that would work on indexes with any number of segments: one method is trivial, based on a "delete/IndexWriter.addIndexes/undeletAll" loop that requires multiple passes over input data, the other would use the same method as SegmentMerger uses, i.e. working with FieldsWriter, FormatPostings*Consumer, TermVectorsWriter, etc. for a single-pass splitting. So I guess I'm -0 on this index splitting method, because I think we can do it better.
        Hide
        Michael McCandless added a comment -

        Looks great, thanks Jason! I just tweaked the javadoc to this:

        /**

        • Command-line tool that enables listing segments in an
        • index, copying specific segments to another index, and
        • deleting segments from an index.
          *
        • <p><b>NOTE</b>: The tool is experimental and might change
        • in incompatible ways in the next release. You can easily
        • accidentally remove segments from your index so be
        • careful!
          */

        My inclination would be to commit this today (ie for 3.0), since it's such an isolated change, but we have said that 3.0 would only turnaround removal of deprecated APIs, cutover to Java 1.5 features, and bug fixes, so if anyone objects to my committing this for 3.0, please speak up soon!

        Show
        Michael McCandless added a comment - Looks great, thanks Jason! I just tweaked the javadoc to this: /** Command-line tool that enables listing segments in an index, copying specific segments to another index, and deleting segments from an index. * <p><b>NOTE</b>: The tool is experimental and might change in incompatible ways in the next release. You can easily accidentally remove segments from your index so be careful! */ My inclination would be to commit this today (ie for 3.0), since it's such an isolated change, but we have said that 3.0 would only turnaround removal of deprecated APIs, cutover to Java 1.5 features, and bug fixes, so if anyone objects to my committing this for 3.0, please speak up soon!
        Hide
        Jason Rutherglen added a comment -

        First cut of the index splitter which allows listing segments, copying segments to a new directory, and removing segments from a directory.

        Show
        Jason Rutherglen added a comment - First cut of the index splitter which allows listing segments, copying segments to a new directory, and removing segments from a directory.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development