Details

    • Type: New Feature
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 3.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If an index has multiple segments, this tool allows splitting those segments into separate directories.

      1. mp-splitter5.patch
        15 kB
        Koji Sekiguchi
      2. mp-splitter4.patch
        15 kB
        Andrzej Bialecki
      3. mp-splitter3.patch
        14 kB
        Andrzej Bialecki
      4. mp-splitter2.patch
        13 kB
        Andrzej Bialecki
      5. mp-splitter-inline.patch
        13 kB
        Uwe Schindler
      6. mp-splitter.patch
        13 kB
        Andrzej Bialecki
      7. LUCENE-1959.patch
        10 kB
        Michael McCandless
      8. LUCENE-1959.patch
        8 kB
        Jason Rutherglen

        Activity

        Hide
        jasonrutherglen Jason Rutherglen added a comment -

        First cut of the index splitter which allows listing segments, copying segments to a new directory, and removing segments from a directory.

        Show
        jasonrutherglen Jason Rutherglen added a comment - First cut of the index splitter which allows listing segments, copying segments to a new directory, and removing segments from a directory.
        Hide
        mikemccand Michael McCandless added a comment -

        Looks great, thanks Jason! I just tweaked the javadoc to this:

        /**

        • Command-line tool that enables listing segments in an
        • index, copying specific segments to another index, and
        • deleting segments from an index.
          *
        • <p><b>NOTE</b>: The tool is experimental and might change
        • in incompatible ways in the next release. You can easily
        • accidentally remove segments from your index so be
        • careful!
          */

        My inclination would be to commit this today (ie for 3.0), since it's such an isolated change, but we have said that 3.0 would only turnaround removal of deprecated APIs, cutover to Java 1.5 features, and bug fixes, so if anyone objects to my committing this for 3.0, please speak up soon!

        Show
        mikemccand Michael McCandless added a comment - Looks great, thanks Jason! I just tweaked the javadoc to this: /** Command-line tool that enables listing segments in an index, copying specific segments to another index, and deleting segments from an index. * <p><b>NOTE</b>: The tool is experimental and might change in incompatible ways in the next release. You can easily accidentally remove segments from your index so be careful! */ My inclination would be to commit this today (ie for 3.0), since it's such an isolated change, but we have said that 3.0 would only turnaround removal of deprecated APIs, cutover to Java 1.5 features, and bug fixes, so if anyone objects to my committing this for 3.0, please speak up soon!
        Hide
        ab Andrzej Bialecki added a comment -

        I'm of a split mind about this splitter in the sense that I'm not sure how useful it is - if your input is an optimized index then it has just 1 segment, so this tool won't be able to split it, right?

        AFAIK a similar functionality can be implemented also using two other methods that would work on indexes with any number of segments: one method is trivial, based on a "delete/IndexWriter.addIndexes/undeletAll" loop that requires multiple passes over input data, the other would use the same method as SegmentMerger uses, i.e. working with FieldsWriter, FormatPostings*Consumer, TermVectorsWriter, etc. for a single-pass splitting.

        So I guess I'm -0 on this index splitting method, because I think we can do it better.

        Show
        ab Andrzej Bialecki added a comment - I'm of a split mind about this splitter in the sense that I'm not sure how useful it is - if your input is an optimized index then it has just 1 segment, so this tool won't be able to split it, right? AFAIK a similar functionality can be implemented also using two other methods that would work on indexes with any number of segments: one method is trivial, based on a "delete/IndexWriter.addIndexes/undeletAll" loop that requires multiple passes over input data, the other would use the same method as SegmentMerger uses, i.e. working with FieldsWriter, FormatPostings*Consumer, TermVectorsWriter, etc. for a single-pass splitting. So I guess I'm -0 on this index splitting method, because I think we can do it better.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        I would put it into contrib (misc next to IndexNormModifier which is also command line), as it is a utility tool. I see no real reason to have it in core. We have then all flexibility to change and optimize it, as Andrzey suggested.

        One thing against this tool in its current form: To copy the files it should use the directory abstraction lay and not use java.io directly. So open IndexInput/IndexOutput to copy the files.

        Show
        thetaphi Uwe Schindler added a comment - - edited I would put it into contrib (misc next to IndexNormModifier which is also command line), as it is a utility tool. I see no real reason to have it in core. We have then all flexibility to change and optimize it, as Andrzey suggested. One thing against this tool in its current form: To copy the files it should use the directory abstraction lay and not use java.io directly. So open IndexInput/IndexOutput to copy the files.
        Hide
        mikemccand Michael McCandless added a comment -

        I would put it into contrib

        +1, I'll do that.

        To copy the files it should use the directory abstraction lay and not use java.io directly.

        I agree, that'd be nice, but I don't think really necessary before committing... it can be another future improvement. But, we should not the limitations of the tool; I'll add javadocs.

        Jason do you want to address any of these issues now (before committing to contrib)?

        Show
        mikemccand Michael McCandless added a comment - I would put it into contrib +1, I'll do that. To copy the files it should use the directory abstraction lay and not use java.io directly. I agree, that'd be nice, but I don't think really necessary before committing... it can be another future improvement. But, we should not the limitations of the tool; I'll add javadocs. Jason do you want to address any of these issues now (before committing to contrib)?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        To copy the files it should use the directory abstraction lay and not use java.io directly.

        I'd use Channels instead - generally much faster.

        Show
        markrmiller@gmail.com Mark Miller added a comment - To copy the files it should use the directory abstraction lay and not use java.io directly. I'd use Channels instead - generally much faster.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        So I guess I'm -0 on this index splitting method, because I think we can do it better.

        Improvements welcome No reason not to start somewhere though.

        Show
        markrmiller@gmail.com Mark Miller added a comment - So I guess I'm -0 on this index splitting method, because I think we can do it better. Improvements welcome No reason not to start somewhere though.
        Hide
        mikemccand Michael McCandless added a comment -

        No reason not to start somewhere though.

        +1

        Progress not perfection!

        Show
        mikemccand Michael McCandless added a comment - No reason not to start somewhere though. +1 Progress not perfection!
        Hide
        mikemccand Michael McCandless added a comment -

        New patch attached: move to contrib/misc, renamed TestFileSplitter -> TestIndexSplitter, added javadocs noting the limitations, added CHANGES entry. I'll commit shortly.

        Show
        mikemccand Michael McCandless added a comment - New patch attached: move to contrib/misc, renamed TestFileSplitter -> TestIndexSplitter, added javadocs noting the limitations, added CHANGES entry. I'll commit shortly.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        small opt - you might switch it to reuse the buffer between files.

        Show
        markrmiller@gmail.com Mark Miller added a comment - small opt - you might switch it to reuse the buffer between files.
        Hide
        mikemccand Michael McCandless added a comment -

        small opt - you might switch it to reuse the buffer between files.

        OK I just committed that!

        Show
        mikemccand Michael McCandless added a comment - small opt - you might switch it to reuse the buffer between files. OK I just committed that!
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Jason!

        Show
        mikemccand Michael McCandless added a comment - Thanks Jason!
        Hide
        ab Andrzej Bialecki added a comment -

        Here's my submission to the index splitting race This version implements the multi-pass method that uses loops of delete/addIndexes/undelete.

        Show
        ab Andrzej Bialecki added a comment - Here's my submission to the index splitting race This version implements the multi-pass method that uses loops of delete/addIndexes/undelete.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Nice! Lets add it to the mix - I'm guessing Jason's is quite a bit faster for splitting segs and this one nicer in that it can split indivd segs. Do we keep two tools or merge them into one with options?

        Show
        markrmiller@gmail.com Mark Miller added a comment - Nice! Lets add it to the mix - I'm guessing Jason's is quite a bit faster for splitting segs and this one nicer in that it can split indivd segs. Do we keep two tools or merge them into one with options?
        Hide
        thetaphi Uwe Schindler added a comment -

        Really cool!

        Show
        thetaphi Uwe Schindler added a comment - Really cool!
        Hide
        mikemccand Michael McCandless added a comment -

        Excellent!

        Show
        mikemccand Michael McCandless added a comment - Excellent!
        Hide
        thetaphi Uwe Schindler added a comment -

        Small optimization for this one: You even do not need a bitset or explicite delete/undelete operations, it can be done inline. Just put the logic into isDeleted() [e.g. modulo or range comparison] and let the TermPositions also check isDeleted().

        Show
        thetaphi Uwe Schindler added a comment - Small optimization for this one: You even do not need a bitset or explicite delete/undelete operations, it can be done inline. Just put the logic into isDeleted() [e.g. modulo or range comparison] and let the TermPositions also check isDeleted().
        Hide
        thetaphi Uwe Schindler added a comment -

        Test fails here (I applied the patch to contrib/misc, but that should be no difference):

            [junit]
            [junit] Testsuite: org.apache.lucene.index.TestMultiPassIndexSplitter
            [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 1,11 sec
            [junit]
            [junit] ------------- Standard Error -----------------
            [junit] Writing part 1 ...
            [junit] Writing part 1 ...
            [junit] ------------- ---------------- ---------------
            [junit] Testcase: testSplitRR(org.apache.lucene.index.TestMultiPassIndexSpli
        tter):  Caused an ERROR
            [junit] null
            [junit] java.lang.AssertionError
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM
        erger.java:600)
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge
        r.java:571)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:152)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:128)
            [junit]     at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja
        va:3367)
            [junit]     at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas
        sIndexSplitter.java:92)
            [junit]     at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitR
        R(TestMultiPassIndexSplitter.java:60)
            [junit]
            [junit]
            [junit] Testcase: testSplitSeq(org.apache.lucene.index.TestMultiPassIndexSpl
        itter): Caused an ERROR
            [junit] null
            [junit] java.lang.AssertionError
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM
        erger.java:600)
            [junit]     at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge
        r.java:571)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:152)
            [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav
        a:128)
            [junit]     at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja
        va:3367)
            [junit]     at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas
        sIndexSplitter.java:92)
            [junit]     at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitS
        eq(TestMultiPassIndexSplitter.java:102)
            [junit]
            [junit]
            [junit] Test org.apache.lucene.index.TestMultiPassIndexSplitter FAILED
            [junit] Testsuite: org.apache.lucene.index.TestTermVectorAccessor
            [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1,079 sec
        
        Show
        thetaphi Uwe Schindler added a comment - Test fails here (I applied the patch to contrib/misc, but that should be no difference): [junit] [junit] Testsuite: org.apache.lucene.index.TestMultiPassIndexSplitter [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 1,11 sec [junit] [junit] ------------- Standard Error ----------------- [junit] Writing part 1 ... [junit] Writing part 1 ... [junit] ------------- ---------------- --------------- [junit] Testcase: testSplitRR(org.apache.lucene.index.TestMultiPassIndexSpli tter): Caused an ERROR [junit] null [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM erger.java:600) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge r.java:571) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:152) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:128) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja va:3367) [junit] at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas sIndexSplitter.java:92) [junit] at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitR R(TestMultiPassIndexSplitter.java:60) [junit] [junit] [junit] Testcase: testSplitSeq(org.apache.lucene.index.TestMultiPassIndexSpl itter): Caused an ERROR [junit] null [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentM erger.java:600) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerge r.java:571) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:152) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.jav a:128) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.ja va:3367) [junit] at org.apache.lucene.index.MultiPassIndexSplitter.split(MultiPas sIndexSplitter.java:92) [junit] at org.apache.lucene.index.TestMultiPassIndexSplitter.testSplitS eq(TestMultiPassIndexSplitter.java:102) [junit] [junit] [junit] Test org.apache.lucene.index.TestMultiPassIndexSplitter FAILED [junit] Testsuite: org.apache.lucene.index.TestTermVectorAccessor [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1,079 sec
        Hide
        thetaphi Uwe Schindler added a comment -

        Here is my inline version without OpenBitSet. The test results are the same (does not pass at same position), but shows, what I meant.

        Patch contains contrib/misc path prefix.

        Show
        thetaphi Uwe Schindler added a comment - Here is my inline version without OpenBitSet. The test results are the same (does not pass at same position), but shows, what I meant. Patch contains contrib/misc path prefix.
        Hide
        jasonrutherglen Jason Rutherglen added a comment -

        I'm using the IndexSplitter to divide a 100GB index into roughly equals parts and deploying into production. Then will clean up the patch.

        I'm not sure why we'd want to use FSDir to copy files as the input parameters are filesystem paths?

        Show
        jasonrutherglen Jason Rutherglen added a comment - I'm using the IndexSplitter to divide a 100GB index into roughly equals parts and deploying into production. Then will clean up the patch. I'm not sure why we'd want to use FSDir to copy files as the input parameters are filesystem paths?
        Hide
        ab Andrzej Bialecki added a comment -

        Now, the mystery is why this test passed when executed in Eclipse - that assert should've tripped then as well. I remember now why I used bitsets - we actually need to know the number of deleted docs to return proper value in IR.numDocs(), and this value is not easy to calculate without actually doing this intersection. Your version looked more elegant, but it still tripped that assert (for good reason). I fixed my version so that it passes the tests when executed through ant (and it still passes in Eclipse, huh .. ).

        Show
        ab Andrzej Bialecki added a comment - Now, the mystery is why this test passed when executed in Eclipse - that assert should've tripped then as well. I remember now why I used bitsets - we actually need to know the number of deleted docs to return proper value in IR.numDocs(), and this value is not easy to calculate without actually doing this intersection. Your version looked more elegant, but it still tripped that assert (for good reason). I fixed my version so that it passes the tests when executed through ant (and it still passes in Eclipse, huh .. ).
        Hide
        thetaphi Uwe Schindler added a comment -

        Ah ok, I didn't look into the test failure yesterday (was too late in the evening), I only wanted to make a quick design and if it would generally work.
        But you are right, the numDocs() return value is incorrect, leading to a failure in this test. But as the test pass in your test environment, the assertion in the SegmentMerger seems not important for functionality. So in general my code and your first code would work correct. I do not know how costly the initial building of the BitSet used for the input reader's deleted docs is, but one possibility would be to only build/use the additional bitset, if hasDeletions() on the original index returns true.

        Thanks for clarifying.

        Show
        thetaphi Uwe Schindler added a comment - Ah ok, I didn't look into the test failure yesterday (was too late in the evening), I only wanted to make a quick design and if it would generally work. But you are right, the numDocs() return value is incorrect, leading to a failure in this test. But as the test pass in your test environment, the assertion in the SegmentMerger seems not important for functionality. So in general my code and your first code would work correct. I do not know how costly the initial building of the BitSet used for the input reader's deleted docs is, but one possibility would be to only build/use the additional bitset, if hasDeletions() on the original index returns true. Thanks for clarifying.
        Hide
        ab Andrzej Bialecki added a comment -

        The test passed in Eclipse only - "ant test" ran from cmdline didn't pass without this fix, so I suspect my Eclipse is to blame for hiding the problem. Re: lazy allocation of bitset - good point, I'll make this change.

        Show
        ab Andrzej Bialecki added a comment - The test passed in Eclipse only - "ant test" ran from cmdline didn't pass without this fix, so I suspect my Eclipse is to blame for hiding the problem. Re: lazy allocation of bitset - good point, I'll make this change.
        Hide
        ab Andrzej Bialecki added a comment -

        As suggested by Uwe, don't allocate the old deletions bitset if there are no deletions.

        Show
        ab Andrzej Bialecki added a comment - As suggested by Uwe, don't allocate the old deletions bitset if there are no deletions.
        Hide
        mikemccand Michael McCandless added a comment -

        Good progress! Andrzej, how about you go ahead & commit yourself?

        Show
        mikemccand Michael McCandless added a comment - Good progress! Andrzej, how about you go ahead & commit yourself?
        Hide
        ab Andrzej Bialecki added a comment -

        I moved the files in this patch to contrib/misc and updated the contrib/CHANGES.txt. If there are no objections I'll commit it soon.

        Show
        ab Andrzej Bialecki added a comment - I moved the files in this patch to contrib/misc and updated the contrib/CHANGES.txt. If there are no objections I'll commit it soon.
        Hide
        koji Koji Sekiguchi added a comment - - edited

        I added small fix. If we have 13 docs (docid=0,1,2,...,12) and numParts=3, 12th doc is missing with -seq mode. I changed this:

        // above range
        for (int j = hi; j < maxDoc; j++) {
           input.deleteDocument(j);
        }
        

        to:

        // above range
        if( i < numParts - 1 ){
          for (int j = hi; j < maxDoc; j++) {
             input.deleteDocument(j);
          }
        }
        
        Show
        koji Koji Sekiguchi added a comment - - edited I added small fix. If we have 13 docs (docid=0,1,2,...,12) and numParts=3, 12th doc is missing with -seq mode. I changed this: // above range for ( int j = hi; j < maxDoc; j++) { input.deleteDocument(j); } to: // above range if ( i < numParts - 1 ){ for ( int j = hi; j < maxDoc; j++) { input.deleteDocument(j); } }
        Hide
        ab Andrzej Bialecki added a comment -

        Indeed, thanks for the fix - I'll commit this.

        Show
        ab Andrzej Bialecki added a comment - Indeed, thanks for the fix - I'll commit this.
        Hide
        ab Andrzej Bialecki added a comment -

        Committed revision 824798.

        Show
        ab Andrzej Bialecki added a comment - Committed revision 824798.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Andrzej!

        Show
        mikemccand Michael McCandless added a comment - Thanks Andrzej!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            jasonrutherglen Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development