Lucene - Core
  1. Lucene - Core
  2. LUCENE-6386

TestIndexWriterForceMerge still unreliable in NIGHTLY

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Discovered by ryan beasting (trunk):

      ant test -Dtestcase=TestIndexWriterForceMerge -Dtests.method=testForceMergeTempSpaceUsage -Dtests.seed=DC9ADB74850A581B -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=sr__#Latn -Dtests.timezone=Indian/Chagos -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

        [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterForceMerge -Dtests.method=testForceMergeTempSpaceUsage -Dtests.seed=DC9ADB74850A581B -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=sr__#Latn -Dtests.timezone=Indian/Chagos -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
         [junit4] FAILURE 1.20s | TestIndexWriterForceMerge.testForceMergeTempSpaceUsage <<<
         [junit4]    > Throwable #1: java.lang.AssertionError: forceMerge used too much temporary space: starting usage was 291570 bytes; final usage was 262469 bytes; max temp usage was 1079501 but should have been 874710 (= 3X starting usage), BEFORE=
         [junit4]    > _u.scf              146329
         [junit4]    > _u.si               635
         [junit4]    >  |- (inside compound file) _u.fld              2214
         [junit4]    >  |- (inside compound file) _u.inf              392
         [junit4]    >  |- (inside compound file) _u.len              2381
         [junit4]    >  |- (inside compound file) _u.pst              36758
         [junit4]    >  |- (inside compound file) _u.vec              104144
         [junit4]    > _s.pst              1338
         [junit4]    > _s.inf              392
         [junit4]    > _s.fld              94
         [junit4]    > _s.len              221
         [junit4]    > _s.vec              3744
         [junit4]    > _s.si               624
         [junit4]    > _t.fld              94
         [junit4]    > _t.len              221
         [junit4]    > _t.pst              1338
         [junit4]    > _t.inf              392
         [junit4]    > _t.vec              3744
         [junit4]    > _t.si               624
         [junit4]    > _v.fld              94
         [junit4]    > _v.pst              1338
         [junit4]    > _v.inf              392
         [junit4]    > _v.vec              3744
         [junit4]    > _v.si               624
         [junit4]    > _v.len              221
         [junit4]    > _w.len              221
         [junit4]    > _w.pst              1338
         [junit4]    > _w.inf              392
         [junit4]    > _w.fld              94
         [junit4]    > _w.si               624
         [junit4]    > _w.vec              3744
         [junit4]    > _x.vec              3744
         [junit4]    > _x.inf              392
         [junit4]    > _x.pst              1338
         [junit4]    > _x.fld              94
         [junit4]    > _x.si               624
         [junit4]    > _x.len              221
         [junit4]    > _y.fld              94
         [junit4]    > _y.pst              1338
         [junit4]    > _y.inf              392
         [junit4]    > _y.si               624
         [junit4]    > _y.vec              3744
         [junit4]    > _y.len              221
         [junit4]    > _z.fld              94
         [junit4]    > _z.pst              1338
         [junit4]    > _z.inf              392
         [junit4]    > _z.len              221
         [junit4]    > _z.vec              3744
         [junit4]    > _z.si               624
         [junit4]    > _10.si              630
         [junit4]    > _10.fld             94
         [junit4]    > _10.pst             1338
         [junit4]    > _10.inf             392
         [junit4]    > _10.vec             3744
         [junit4]    > _10.len             221
         [junit4]    > _11.len             221
         [junit4]    > _11.si              630
         [junit4]    > _11.vec             3744
         [junit4]    > _11.pst             1338
         [junit4]    > _11.inf             392
         [junit4]    > _11.fld             94
         [junit4]    > _12.vec             3744
         [junit4]    > _12.si              630
         [junit4]    > _12.len             221
         [junit4]    > _12.fld             94
         [junit4]    > _12.pst             1338
         [junit4]    > _12.inf             392
         [junit4]    > _13.fld             94
         [junit4]    > _13.vec             3744
         [junit4]    > _13.si              630
         [junit4]    > _13.pst             1338
         [junit4]    > _13.inf             392
         [junit4]    > _13.len             221
         [junit4]    > _14.fld             94
         [junit4]    > _14.pst             1338
         [junit4]    > _14.inf             392
         [junit4]    > _14.si              630
         [junit4]    > _14.vec             3744
         [junit4]    > _14.len             221
         [junit4]    > _15.len             221
         [junit4]    > _15.vec             3744
         [junit4]    > _15.si              630
         [junit4]    > _15.pst             1338
         [junit4]    > _15.inf             392
         [junit4]    > _15.fld             94
         [junit4]    > _16.vec             3744
         [junit4]    > _16.len             221
         [junit4]    > _16.fld             94
         [junit4]    > _16.si              630
         [junit4]    > _16.pst             1338
         [junit4]    > _16.inf             392
         [junit4]    > _17.vec             3744
         [junit4]    > _17.pst             1338
         [junit4]    > _17.inf             392
         [junit4]    > _17.len             221
         [junit4]    > _17.si              630
         [junit4]    > _17.fld             94
         [junit4]    > _18.pst             1338
         [junit4]    > _18.inf             392
         [junit4]    > _18.len             221
         [junit4]    > _18.vec             3744
         [junit4]    > _18.si              630
         [junit4]    > _18.fld             94
         [junit4]    > _19.fld             94
         [junit4]    > _19.si              630
         [junit4]    > _19.len             221
         [junit4]    > _19.vec             3744
         [junit4]    > _19.pst             1338
         [junit4]    > _19.inf             392
         [junit4]    > _1a.fld             94
         [junit4]    > _1a.pst             1338
         [junit4]    > _1a.inf             392
         [junit4]    > _1a.len             221
         [junit4]    > _1a.vec             3744
         [junit4]    > _1a.si              630
         [junit4]    > _1b.fld             94
         [junit4]    > _1b.si              630
         [junit4]    > _1b.pst             1338
         [junit4]    > _1b.inf             392
         [junit4]    > _1b.vec             3744
         [junit4]    > _1b.len             221
         [junit4]    > _1c.vec             3744
         [junit4]    > _1c.pst             1338
         [junit4]    > _1c.inf             392
         [junit4]    > _1c.fld             94
         [junit4]    > _1c.len             221
         [junit4]    > _1c.si              630
         [junit4]    > _1d.pst             1338
         [junit4]    > _1d.inf             392
         [junit4]    > _1d.fld             94
         [junit4]    > _1d.len             221
         [junit4]    > _1d.vec             3744
         [junit4]    > _1d.si              630
         [junit4]    > _1e.fld             94
         [junit4]    > _1e.pst             1338
         [junit4]    > _1e.inf             392
         [junit4]    > _1e.len             221
         [junit4]    > _1e.vec             3744
         [junit4]    > _1e.si              630
         [junit4]    > _1f.fld             40
         [junit4]    > _1f.pst             195
         [junit4]    > _1f.inf             392
         [junit4]    > _1f.len             149
         [junit4]    > _1f.vec             405
         [junit4]    > _1f.si              629
         [junit4]    > AFTER=
         [junit4]    > _1j.si              635
         [junit4]    > _1j.scf             261701
         [junit4]    >  |- (inside compound file) _1j.fld             3982
         [junit4]    >  |- (inside compound file) _1j.inf             392
         [junit4]    >  |- (inside compound file) _1j.len             4149
         [junit4]    >  |- (inside compound file) _1j.pst             66151
         [junit4]    >  |- (inside compound file) _1j.vec             186577
         [junit4]    > 	at __randomizedtesting.SeedInfo.seed([DC9ADB74850A581B:C6581887EB1ABBDC]:0)
         [junit4]    > 	at org.apache.lucene.index.TestIndexWriterForceMerge.testForceMergeTempSpaceUsage(TestIndexWriterForceMerge.java:199)
         [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
         [junit4]   2> NOTE: test params are: codec=SimpleText, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {id=IB SPL-D2, content=DFR I(F)B1}, locale=sr__#Latn, timezone=Indian/Chagos
         [junit4]   2> NOTE: Linux 3.13.0-45-generic amd64/Oracle Corporation 1.8.0_40-ea (64-bit)/cpus=8,threads=1,free=208092000,total=253231104
         [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterForceMerge]
         [junit4] Completed in 1.54s, 1 test, 1 failure <<< FAILURES!
      
      1. LUCENE-6386.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        See attached log in the description. When i fought with this test last night, i tried to make it easier to debug (print inner CFS content sizes etc).

        But its still hard, maybe someone else should have a look at the test. We could try to summarize bytes by extension, that might help.

        In this case its not the ending size after forceMerge that is the problem, but the peak disk usage as tracked by MDW.

        Maybe it only happens in NIGHTLY because both of the two fields involved randomly got term vector options set (which is rare):
        [junit4] 1> NOTE: LuceneTestCase: upgrade name=content type=indexed,tokenized,termVector,termVectorPosition,termVectorPayloads
        [junit4] 1> NOTE: LuceneTestCase: upgrade name=id type=stored,indexed,tokenized,termVector,termVectorOffsets

        Show
        Robert Muir added a comment - See attached log in the description. When i fought with this test last night, i tried to make it easier to debug (print inner CFS content sizes etc). But its still hard, maybe someone else should have a look at the test. We could try to summarize bytes by extension, that might help. In this case its not the ending size after forceMerge that is the problem, but the peak disk usage as tracked by MDW. Maybe it only happens in NIGHTLY because both of the two fields involved randomly got term vector options set (which is rare): [junit4] 1> NOTE: LuceneTestCase: upgrade name=content type=indexed,tokenized,termVector,termVectorPosition,termVectorPayloads [junit4] 1> NOTE: LuceneTestCase: upgrade name=id type=stored,indexed,tokenized,termVector,termVectorOffsets
        Hide
        Michael McCandless added a comment -

        I'm digging here...

        Show
        Michael McCandless added a comment - I'm digging here...
        Hide
        Michael McCandless added a comment -

        I think this was caused by LUCENE-5941, where we managed to reason to ourselves that forceMerge only requires up to 2X your index size in additional free space, but I think in fact it requires up to 3X in the worst case (as was documented/tested before LUCENE-5941), which this seed hits.

        The worst case is CFS is enabled, forceMerge to 1 segment, there are enough segments that a first pass of N merges must be done first, followed by a final merge to produce the 1 segment, no deletes.

        In this case the first round of level 1 merges use up 1X spare space, then the level 2 merge down to 1 segment uses up another 1X, then to build the CFS for that segment uses another 1X = total 3X, like we documented/enforced before LUCENE-5941.

        Am I missing something....?

        Show
        Michael McCandless added a comment - I think this was caused by LUCENE-5941 , where we managed to reason to ourselves that forceMerge only requires up to 2X your index size in additional free space, but I think in fact it requires up to 3X in the worst case (as was documented/tested before LUCENE-5941 ), which this seed hits. The worst case is CFS is enabled, forceMerge to 1 segment, there are enough segments that a first pass of N merges must be done first, followed by a final merge to produce the 1 segment, no deletes. In this case the first round of level 1 merges use up 1X spare space, then the level 2 merge down to 1 segment uses up another 1X, then to build the CFS for that segment uses another 1X = total 3X, like we documented/enforced before LUCENE-5941 . Am I missing something....?
        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready.

        Show
        Michael McCandless added a comment - Patch, I think it's ready.
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Shai Erera added a comment -

        So if we have the index in the following state, according to what you describe we will use additional 3X space:

        source:               [s1 s2 s3]  [s4 s5 s6]  [s7 s8 s9]
        level1 (not CFS'd):      [s10]       [s11]      [s12]
        level2 (non CFS'd):                 [s13]
        result (CFS'd):                     [s14]
        

        When we create level2, we indeed use additional 2X space. But is there a reason why when that's done we don't delete level1's intermediate files? Nobody references them correct?

        If we don't, then what happens when that forceMerge() hits 10 levels of merges? Will all the intermediate files be kept in the index? If so, then during forceMerge the additional space required can be much higher than just 3X.

        Show
        Shai Erera added a comment - So if we have the index in the following state, according to what you describe we will use additional 3X space: source: [s1 s2 s3] [s4 s5 s6] [s7 s8 s9] level1 (not CFS'd): [s10] [s11] [s12] level2 (non CFS'd): [s13] result (CFS'd): [s14] When we create level2, we indeed use additional 2X space. But is there a reason why when that's done we don't delete level1's intermediate files? Nobody references them correct? If we don't, then what happens when that forceMerge() hits 10 levels of merges? Will all the intermediate files be kept in the index? If so, then during forceMerge the additional space required can be much higher than just 3X.
        Hide
        Robert Muir added a comment -

        When we create level2, we indeed use additional 2X space. But is there a reason why when that's done we don't delete level1's intermediate files? Nobody references them correct?

        IFD will delete level1 files after level2 is finally done. So if it proceeded to level3, it would still be 3x. its always 3x because at some time you can have N, N cfs packing, and N-1.

        Show
        Robert Muir added a comment - When we create level2, we indeed use additional 2X space. But is there a reason why when that's done we don't delete level1's intermediate files? Nobody references them correct? IFD will delete level1 files after level2 is finally done. So if it proceeded to level3, it would still be 3x. its always 3x because at some time you can have N, N cfs packing, and N-1.
        Hide
        Shai Erera added a comment -

        Right, therefore I still think we consume up to 2X additional space? When we build level2, we have source + level1 + level2. When level2 is complete, we delete level1, and so we have source + level2. When we build level3 (the CFS), we have source + level2 + level3. No?

        Show
        Shai Erera added a comment - Right, therefore I still think we consume up to 2X additional space? When we build level2, we have source + level1 + level2. When level2 is complete, we delete level1, and so we have source + level2. When we build level3 (the CFS), we have source + level2 + level3. No?
        Hide
        Michael McCandless added a comment -

        This is hard to think about

        In your example I'm assuming "result" is just the same as "level2", but converted into CFS (ie, s14 really should have been s13.cfs, but I'll just call it s14 here to try not to lose my sanity).

        So, at the exact moment when IW finishes building the final (level3) CFS [s14], you have:

        • original source segments -> 1X
        • level 1 segments -> 1X
        • level 2 segment [s13] non-CFS -> 1X
        • result segment [s14] CFS -> 1X

        We do NOT delete level 1 segments after merging to the level 2 non-CFS segment before creating the result s14 (we used to do this, but it caused complexity/problems because a non-CFS file can unexpectedly "sneak into" an IW commit or an NRT reader even when you demanded CFS).

        So at that moment, the peak temp disk usage is 3X?

        Show
        Michael McCandless added a comment - This is hard to think about In your example I'm assuming "result" is just the same as "level2", but converted into CFS (ie, s14 really should have been s13.cfs, but I'll just call it s14 here to try not to lose my sanity). So, at the exact moment when IW finishes building the final (level3) CFS [s14] , you have: original source segments -> 1X level 1 segments -> 1X level 2 segment [s13] non-CFS -> 1X result segment [s14] CFS -> 1X We do NOT delete level 1 segments after merging to the level 2 non-CFS segment before creating the result s14 (we used to do this, but it caused complexity/problems because a non-CFS file can unexpectedly "sneak into" an IW commit or an NRT reader even when you demanded CFS). So at that moment, the peak temp disk usage is 3X?
        Hide
        Shai Erera added a comment -

        s14 really should have been s13.cfs

        You're right, my mistake. The example should have listed s13.cfs.

        We do NOT delete level 1 segments after merging to the level 2 non-CFS segment before creating the result s14

        I see. That explains the 4X. But then what happens if there are multi-level merges? Do we delete any of these files? Let me give a concrete example:

        source: [s1,s2] [s3,s4] [s5,s6] [s7,s8]
        level1: [s9,s10], [s11,s12]
        level2: [s13,s14]
        level3: [s15]
        final:  [s15.cfs]
        

        Would we take 5X disk space in that case??

        Show
        Shai Erera added a comment - s14 really should have been s13.cfs You're right, my mistake. The example should have listed s13.cfs . We do NOT delete level 1 segments after merging to the level 2 non-CFS segment before creating the result s14 I see. That explains the 4X. But then what happens if there are multi-level merges? Do we delete any of these files? Let me give a concrete example: source: [s1,s2] [s3,s4] [s5,s6] [s7,s8] level1: [s9,s10], [s11,s12] level2: [s13,s14] level3: [s15] final: [s15.cfs] Would we take 5X disk space in that case??
        Hide
        Michael McCandless added a comment -

        If there are more levels we don't use any more space, because we will fully delete the intermediate levels after they are merged.

        So in your example, once s13 is done (and made into .cfs if necessary) we delete s9 and s10, as long as no NRT reader was opened holding a reference to them (or, a commit).

        Show
        Michael McCandless added a comment - If there are more levels we don't use any more space, because we will fully delete the intermediate levels after they are merged. So in your example, once s13 is done (and made into .cfs if necessary) we delete s9 and s10, as long as no NRT reader was opened holding a reference to them (or, a commit).
        Hide
        Shai Erera added a comment -

        I thought we only write the final segments as .cfs if needed? Does what you say mean the following happens:

        source: [s1,s2] [s3,s4] [s5,s6] [s7,s8]
        level1: [s9,s10], [s11,s12]
        level1: [s9.cfs,s10.cfs], [s11.cfs,s12.cfs]
        level2: [s13,s14]
        level2: [s13.cfs,s14.cfs]
        level3: [s15]
        final:  [s15.cfs]
        

        So when it gets to create level3, with the level2.cfs written, the index contains source (1X), level2.cfs (1X), level3 (1X) and level3.cfs (1X) – that's 4X indeed.

        Is there a reason why we don't delete level2.cfs after creating level3? If not, can we fix that and reduce 1X here?

        Show
        Shai Erera added a comment - I thought we only write the final segments as .cfs if needed? Does what you say mean the following happens: source: [s1,s2] [s3,s4] [s5,s6] [s7,s8] level1: [s9,s10], [s11,s12] level1: [s9.cfs,s10.cfs], [s11.cfs,s12.cfs] level2: [s13,s14] level2: [s13.cfs,s14.cfs] level3: [s15] final: [s15.cfs] So when it gets to create level3, with the level2.cfs written, the index contains source (1X), level2.cfs (1X), level3 (1X) and level3.cfs (1X) – that's 4X indeed. Is there a reason why we don't delete level2.cfs after creating level3 ? If not, can we fix that and reduce 1X here?
        Hide
        Robert Muir added a comment -

        Looking at the code, I think that is too aggressive/complicated to do right now. The CFS interaction with merging is already absolutely horrible.

        We should fix the documentation bug at the moment. But IMO we should not do this optimization until this stuff is cleaned up (LUCENE-5988)

        Show
        Robert Muir added a comment - Looking at the code, I think that is too aggressive/complicated to do right now. The CFS interaction with merging is already absolutely horrible. We should fix the documentation bug at the moment. But IMO we should not do this optimization until this stuff is cleaned up ( LUCENE-5988 )
        Hide
        Michael McCandless added a comment -

        I thought we only write the final segments as .cfs if needed?

        No: every segment that's written (flush or merge) decides whether to be cfs or not and then builds the cfs immediately after writing...

        Is there a reason why we don't delete level2.cfs after creating level3? If not, can we fix that and reduce 1X here?

        In the past we did that, by exposing level3 (not yet a CFS) into IW's SIS, and then deleting level2.cfs before building level3.cfs ... but this proved tricky (I think we had ref count bugs around this, and it also allowed commit / NRT readers to see a non-CFS segment when such external usage expected the segments to be CFS).

        Show
        Michael McCandless added a comment - I thought we only write the final segments as .cfs if needed? No: every segment that's written (flush or merge) decides whether to be cfs or not and then builds the cfs immediately after writing... Is there a reason why we don't delete level2.cfs after creating level3? If not, can we fix that and reduce 1X here? In the past we did that, by exposing level3 (not yet a CFS) into IW's SIS, and then deleting level2.cfs before building level3.cfs ... but this proved tricky (I think we had ref count bugs around this, and it also allowed commit / NRT readers to see a non-CFS segment when such external usage expected the segments to be CFS).
        Hide
        Shai Erera added a comment -

        OK, I understand. So in theory we could reduce it back to 3X (which is good!), but we'd need to do it carefully and make sure we don't reintroduce bugs. I'm fine if it's done one day and not here.

        Show
        Shai Erera added a comment - OK, I understand. So in theory we could reduce it back to 3X (which is good!), but we'd need to do it carefully and make sure we don't reintroduce bugs. I'm fine if it's done one day and not here.
        Hide
        ASF subversion and git services added a comment -

        Commit 1670959 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1670959 ]

        LUCENE-6386: correct javadocs about temp disk space required for forceMerge(1)

        Show
        ASF subversion and git services added a comment - Commit 1670959 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1670959 ] LUCENE-6386 : correct javadocs about temp disk space required for forceMerge(1)
        Hide
        ASF subversion and git services added a comment -

        Commit 1670960 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670960 ]

        LUCENE-6386: correct javadocs about temp disk space required for forceMerge(1)

        Show
        ASF subversion and git services added a comment - Commit 1670960 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670960 ] LUCENE-6386 : correct javadocs about temp disk space required for forceMerge(1)
        Hide
        ASF subversion and git services added a comment -

        Commit 1670963 from Michael McCandless in branch 'dev/branches/lucene_solr_5_1'
        [ https://svn.apache.org/r1670963 ]

        LUCENE-6386: correct javadocs about temp disk space required for forceMerge(1)

        Show
        ASF subversion and git services added a comment - Commit 1670963 from Michael McCandless in branch 'dev/branches/lucene_solr_5_1' [ https://svn.apache.org/r1670963 ] LUCENE-6386 : correct javadocs about temp disk space required for forceMerge(1)
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development