Solr
  1. Solr
  2. SOLR-4926

I am seeing RecoveryZkTest and ChaosMonkeySafeLeaderTest fail often on trunk.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Issue Links

        Activity

        Hide
        Mark Miller added a comment -

        I've got some theories, but nothing hard. I do think this was caused very recently. I'm going to use fullmetaljenkins tomorrow to track down the issue.

        Show
        Mark Miller added a comment - I've got some theories, but nothing hard. I do think this was caused very recently. I'm going to use fullmetaljenkins tomorrow to track down the issue.
        Hide
        Mark Miller added a comment -

        It looks like replication thinks it's successful, then buffered replays are done - but only the buffered replays work - the replication is adding no docs. Somehow I think the compound file format stuff affected this, but no clue how yet.

        Show
        Mark Miller added a comment - It looks like replication thinks it's successful, then buffered replays are done - but only the buffered replays work - the replication is adding no docs. Somehow I think the compound file format stuff affected this, but no clue how yet.
        Hide
        Mark Miller added a comment -

        Hmm...that may have just been what that one case looked like - looking at another case now that may not match. More digging...

        Show
        Mark Miller added a comment - Hmm...that may have just been what that one case looked like - looking at another case now that may not match. More digging...
        Hide
        Yonik Seeley added a comment - - edited

        In some of the fails, I'm seeing some errors of this form:

          2> Caused by: org.apache.solr.common.SolrException: Error opening Reader
          2>    at org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:174)
          2>    at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:185)
          2>    at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:181)
          2>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1487)
          2>    ... 15 more
          2> Caused by: java.lang.AssertionError: liveDocs.count()=4 info.docCount=6 info.getDelCount()=6
          2>    at org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:92)
        

        edit: it looks like this type of error is appearing in about 20% of my fails.

        Show
        Yonik Seeley added a comment - - edited In some of the fails, I'm seeing some errors of this form: 2> Caused by: org.apache.solr.common.SolrException: Error opening Reader 2> at org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:174) 2> at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:185) 2> at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:181) 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1487) 2> ... 15 more 2> Caused by: java.lang.AssertionError: liveDocs.count()=4 info.docCount=6 info.getDelCount()=6 2> at org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:92) edit: it looks like this type of error is appearing in about 20% of my fails.
        Hide
        Yonik Seeley added a comment -

        I hacked the lucene IWC and MergePolicy classes to never use compound format, and then started ChaosMonkeySafeLeaderTest tests in a loop.
        11 passes in a row so far, so it definitely looks like these failures are related to the compound file format.

        Show
        Yonik Seeley added a comment - I hacked the lucene IWC and MergePolicy classes to never use compound format, and then started ChaosMonkeySafeLeaderTest tests in a loop. 11 passes in a row so far, so it definitely looks like these failures are related to the compound file format.
        Hide
        Uwe Schindler added a comment - - edited

        How does this test depend on CFS or not? So it looks like replication does not work correctly with CFS, which is a serious bug!

        Show
        Uwe Schindler added a comment - - edited How does this test depend on CFS or not? So it looks like replication does not work correctly with CFS, which is a serious bug!
        Hide
        Yonik Seeley added a comment -

        How does this test depend on CFS or not?

        That's the million dollar question It does not, explicitly, but it seems like the use of CFS somehow causes replication to fail.

        Show
        Yonik Seeley added a comment - How does this test depend on CFS or not? That's the million dollar question It does not, explicitly, but it seems like the use of CFS somehow causes replication to fail.
        Hide
        Mark Miller added a comment - - edited

        the use of CFS somehow causes replication to fail

        Yeah, this is what I'm seeing - I just caught a really good sample case with decent logging.

        The recovering replica commits on the leader and that leader then has 126 docs to replicate.

        16 documents end up on the relica after the replication - 110 short.

        Before the replication, the leader is on gen 3, the replica on gen 1.

        Perhaps a red herring, but in the many cases of this I've looked at, oddly, no buffered docs are ever replayed after that - though I have seen buffered docs replayed in those same runs when the replication did not fail. Weird observation.

        Anyway, I need to turn on more replication level logging I think.

        Show
        Mark Miller added a comment - - edited the use of CFS somehow causes replication to fail Yeah, this is what I'm seeing - I just caught a really good sample case with decent logging. The recovering replica commits on the leader and that leader then has 126 docs to replicate. 16 documents end up on the relica after the replication - 110 short. Before the replication, the leader is on gen 3, the replica on gen 1. Perhaps a red herring, but in the many cases of this I've looked at, oddly, no buffered docs are ever replayed after that - though I have seen buffered docs replayed in those same runs when the replication did not fail. Weird observation. Anyway, I need to turn on more replication level logging I think.
        Hide
        Mark Miller added a comment -

        Reviewing some more sample fails of RecoveryZkTest:

        It actually looks like after the replication we end up with one commit point back - eg we are trying to replicate gen 3 and replica moves from gen 1 to gen 2.

        • Mark
        Show
        Mark Miller added a comment - Reviewing some more sample fails of RecoveryZkTest: It actually looks like after the replication we end up with one commit point back - eg we are trying to replicate gen 3 and replica moves from gen 1 to gen 2. Mark
        Hide
        Mark Miller added a comment -

        In the case where the slave is on gen 2, it did just download the files for gen 3 - so it seems we are not picking up the latest commit point somehow..

        Show
        Mark Miller added a comment - In the case where the slave is on gen 2, it did just download the files for gen 3 - so it seems we are not picking up the latest commit point somehow..
        Hide
        Hoss Man added a comment -

        Somehow I think the compound file format stuff affected this, but no clue how yet.

        FYI: LUCENE-5038 inadvertantly caused the default compound file behavior in solr to change: it's suppose to be that in Solr compound files are not used unless explicitly configured. SOLR-4941 will fix this, so once it's committed you will probably start getting "false success" from some of these tests again.

        strongly suggest that the cloud & replication tests start randomizing <useCompoundFile> in the solrconfig.xml via a sys property.

        Show
        Hoss Man added a comment - Somehow I think the compound file format stuff affected this, but no clue how yet. FYI: LUCENE-5038 inadvertantly caused the default compound file behavior in solr to change: it's suppose to be that in Solr compound files are not used unless explicitly configured. SOLR-4941 will fix this, so once it's committed you will probably start getting "false success" from some of these tests again. strongly suggest that the cloud & replication tests start randomizing <useCompoundFile> in the solrconfig.xml via a sys property.
        Hide
        Mark Miller added a comment -

        I've been focusing on the RecoveryZkTest case.

        Every fail I've looked at has used the RAM dir. Odd because the safe leader test that fails is hard coded to not use ramdir I think. RecoveryZkTest also uses mock dir, but I don't think the safe leader test does because of the hard coding to standard dir.

        Anyway, more on what I'm seeing from the RecoveryZkTest fails:

        we replicate gen 3 files, we reopen the writer and then the searcher using that writer - we get an index of gen 2 - the files from the searcher's directory don't contain the newly replicated files, just the gen 2 index files.

        Show
        Mark Miller added a comment - I've been focusing on the RecoveryZkTest case. Every fail I've looked at has used the RAM dir. Odd because the safe leader test that fails is hard coded to not use ramdir I think. RecoveryZkTest also uses mock dir, but I don't think the safe leader test does because of the hard coding to standard dir. Anyway, more on what I'm seeing from the RecoveryZkTest fails: we replicate gen 3 files, we reopen the writer and then the searcher using that writer - we get an index of gen 2 - the files from the searcher's directory don't contain the newly replicated files, just the gen 2 index files.
        Hide
        Mark Miller added a comment - - edited

        Still digging into RecoveryZkTest fails with compound file format...

        Still only seems to happen with RamDir.

        Oddly, it seems that the searcher opened after the replication is somehow using a different Directory than the one the replication goes to and the writer is reopened on. Even though they are directories for the same path. I don't know how this can happen yet - and it would seem this may not be the same issue that affected the safe leader test considering that test looks hard coded to standard directory.

        Show
        Mark Miller added a comment - - edited Still digging into RecoveryZkTest fails with compound file format... Still only seems to happen with RamDir. Oddly, it seems that the searcher opened after the replication is somehow using a different Directory than the one the replication goes to and the writer is reopened on. Even though they are directories for the same path. I don't know how this can happen yet - and it would seem this may not be the same issue that affected the safe leader test considering that test looks hard coded to standard directory.
        Hide
        Yonik Seeley added a comment - - edited

        I started looking for failed assertions, and ran across one not related to index corruption. This is prob not related to our current issues though.

          2> 68103 T325 oasc.SolrException.log ERROR java.lang.AssertionError
          2>            at org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:62)
          2>            at org.apache.solr.core.SolrCore.close(SolrCore.java:1059)
          2>            at org.apache.solr.handler.admin.CoreAdminHandler$2.run(CoreAdminHandler.java:814)
          2>    
          2> 68104 T206 C40 P47070 oasc.SolrException.log ERROR java.lang.AssertionError
          2>            at org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:62)
          2>            at org.apache.solr.core.SolrCore.close(SolrCore.java:1059)
          2>            at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:389)
        

        edit: I first hit this 5 times within a 30 minute period... and then I added logging to try and nail it down and it hasn't happened yet after over an hour of running. I'm going to give up soon and get back to the main task at hand (chaos compound-related failures).

        edit: lack of reproducibility was caused by a cut-n-paste bug - it's now back to being very reproducible. Tracked it down to SOLR-4960.

        Show
        Yonik Seeley added a comment - - edited I started looking for failed assertions, and ran across one not related to index corruption. This is prob not related to our current issues though. 2> 68103 T325 oasc.SolrException.log ERROR java.lang.AssertionError 2> at org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:62) 2> at org.apache.solr.core.SolrCore.close(SolrCore.java:1059) 2> at org.apache.solr.handler.admin.CoreAdminHandler$2.run(CoreAdminHandler.java:814) 2> 2> 68104 T206 C40 P47070 oasc.SolrException.log ERROR java.lang.AssertionError 2> at org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:62) 2> at org.apache.solr.core.SolrCore.close(SolrCore.java:1059) 2> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:389) edit: I first hit this 5 times within a 30 minute period... and then I added logging to try and nail it down and it hasn't happened yet after over an hour of running. I'm going to give up soon and get back to the main task at hand (chaos compound-related failures). edit: lack of reproducibility was caused by a cut-n-paste bug - it's now back to being very reproducible. Tracked it down to SOLR-4960 .
        Hide
        Mark Miller added a comment -

        Just to update:

        is somehow using a different Directory

        I was confused - better logging shows it's the same directory, just with completely different files in it. So I'm still confused.

        I see the index getting replicated into the directory and the right files in the directory.
        I see the IndexWriter open on the directory and all the right files in the directory.
        I see the IndexReader reopened with that IndexWriter - and boom, it's got a different set of files. It started with an index of gen1, replicated an index of gen 3, and ends up with a gen 2 index with totally different files.

        Show
        Mark Miller added a comment - Just to update: is somehow using a different Directory I was confused - better logging shows it's the same directory, just with completely different files in it. So I'm still confused. I see the index getting replicated into the directory and the right files in the directory. I see the IndexWriter open on the directory and all the right files in the directory. I see the IndexReader reopened with that IndexWriter - and boom, it's got a different set of files. It started with an index of gen1, replicated an index of gen 3, and ends up with a gen 2 index with totally different files.
        Hide
        Mark Miller added a comment -

        It turns out this was all being caused by:

        [junit4:junit4]   2> org.apache.lucene.index.CorruptIndexException: invalid deletion count: 38 (resource: ChecksumIndexInput(RAMInputStream(name=segments_3)))
        [junit4:junit4]   2> 	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340)
        [junit4:junit4]   2> 	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:374)
        

        That causes the opening IW to roll back to gen 2 and remove the gen 3 files.

        Show
        Mark Miller added a comment - It turns out this was all being caused by: [junit4:junit4] 2> org.apache.lucene.index.CorruptIndexException: invalid deletion count: 38 (resource: ChecksumIndexInput(RAMInputStream(name=segments_3))) [junit4:junit4] 2> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340) [junit4:junit4] 2> at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:374) That causes the opening IW to roll back to gen 2 and remove the gen 3 files.
        Hide
        Yonik Seeley added a comment -

        So one thing I see is that when the old IndexWriter is closed, a new segment file is written out.

          2> 58339 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter Closing old IndexWriter... core=collection1 dir=org.apache.lucene.store.MMapDirectory@/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@42062bad
          2> 58341 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter PREFILES=[_0.fdx, segments_1, _0.cfs, _0.si, _0.cfe, _0.fdt, segments.gen, segments_4, _0_1.del, write.lock]
        
        
          2> 58622 T257 C69 P58203 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits DEBUG_INFO: num=2
          2>            commit{dir=/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index,segFN=segments_1,generation=1,filenam
        es=[segments_1]}
          2>            commit{dir=/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index,segFN=segments_2,generation=2,filenam
        es=[_0.cfs, _0.cfe, _0_1.del, segments_2, _0.si]}
        
        
          2> 58630 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter POSTFILES=[segments_2, _0.cfs, _0.si, _0.cfe, segments.gen, segments_4, _0_1.del]
        
        

        It looks like this new segments file actually references all of the files we just replicated, which is of course wrong!

        Show
        Yonik Seeley added a comment - So one thing I see is that when the old IndexWriter is closed, a new segment file is written out. 2> 58339 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter Closing old IndexWriter... core=collection1 dir=org.apache.lucene.store.MMapDirectory@/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@42062bad 2> 58341 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter PREFILES=[_0.fdx, segments_1, _0.cfs, _0.si, _0.cfe, _0.fdt, segments.gen, segments_4, _0_1.del, write.lock] 2> 58622 T257 C69 P58203 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits DEBUG_INFO: num=2 2> commit{dir=/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index,segFN=segments_1,generation=1,filenam es=[segments_1]} 2> commit{dir=/opt/code/lusolr_clean5/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1372252751336/jetty12/index,segFN=segments_2,generation=2,filenam es=[_0.cfs, _0.cfe, _0_1.del, segments_2, _0.si]} 2> 58630 T257 C69 P58203 oasu.DefaultSolrCoreState.newIndexWriter POSTFILES=[segments_2, _0.cfs, _0.si, _0.cfe, segments.gen, segments_4, _0_1.del] It looks like this new segments file actually references all of the files we just replicated, which is of course wrong!
        Hide
        Mark Miller added a comment -

        Cool - this fits what I am seeing as well. I also checked to see that you could open a writer on the index we replicated and you can no problem. I was struggling to find what could cause the corruption from there - was suspecting the copy to the other dir somehow, but the writer close makes a lot of sense.

        And explains the mysterious segments_2 file when I only expected segments_1 and segments_3.

        Show
        Mark Miller added a comment - Cool - this fits what I am seeing as well. I also checked to see that you could open a writer on the index we replicated and you can no problem. I was struggling to find what could cause the corruption from there - was suspecting the copy to the other dir somehow, but the writer close makes a lot of sense. And explains the mysterious segments_2 file when I only expected segments_1 and segments_3.
        Hide
        Yonik Seeley added a comment -

        OK, looks like the close of the old IW is actually changing some of the segment files (the number in parens is the file size).
        Now the question is why...

          2> 57648 T256 C30 P58775 oasu.DefaultSolrCoreState.showFiles PREFILES _0.cfe(266)_0.cfs(4375)_0.fdt(0)_0.fdx(0)_0.si(239)_0_1.del(32)segments.gen(20)segments_1(45)segments_4(99)write.lock(0)
        
          2> 57887 T256 C30 P58775 oasu.DefaultSolrCoreState.showFiles POSTFILES _0.cfe(266)_0.cfs(2993)_0.si(239)_0_1.del(31)segments.gen(20)segments_2(70)segments_4(99)
        
        Show
        Yonik Seeley added a comment - OK, looks like the close of the old IW is actually changing some of the segment files (the number in parens is the file size). Now the question is why... 2> 57648 T256 C30 P58775 oasu.DefaultSolrCoreState.showFiles PREFILES _0.cfe(266)_0.cfs(4375)_0.fdt(0)_0.fdx(0)_0.si(239)_0_1.del(32)segments.gen(20)segments_1(45)segments_4(99)write.lock(0) 2> 57887 T256 C30 P58775 oasu.DefaultSolrCoreState.showFiles POSTFILES _0.cfe(266)_0.cfs(2993)_0.si(239)_0_1.del(31)segments.gen(20)segments_2(70)segments_4(99)
        Hide
        Yonik Seeley added a comment -

        Running with a local modification that does a commit at the start of snappuller (as well as getting the latest commit point from the deletion policy). Crossing my fingers... 12 successes in a row so far.

        Show
        Yonik Seeley added a comment - Running with a local modification that does a commit at the start of snappuller (as well as getting the latest commit point from the deletion policy). Crossing my fingers... 12 successes in a row so far.
        Hide
        Mark Miller added a comment -

        I have this competing approach attached - all tests pass and my zkrecoverytest loop is holding up pretty well (22 passes so far)

        Show
        Mark Miller added a comment - I have this competing approach attached - all tests pass and my zkrecoverytest loop is holding up pretty well (22 passes so far)
        Hide
        Yonik Seeley added a comment - - edited

        I have this competing approach attached

        Yeah, I like the rollback right before... safer in the event that somehow the IW acquires more changes. I was initially worried about code that uses the IW to reopen the reader, but that code (openSearcher) doesn't force open a writer and has a fallback to reopen on the reader.

        I'll start looping the chaos test with your patch and see how it goes.
        edit: 20 passes so far... commit it!

        Show
        Yonik Seeley added a comment - - edited I have this competing approach attached Yeah, I like the rollback right before... safer in the event that somehow the IW acquires more changes. I was initially worried about code that uses the IW to reopen the reader, but that code (openSearcher) doesn't force open a writer and has a fallback to reopen on the reader. I'll start looping the chaos test with your patch and see how it goes. edit: 20 passes so far... commit it!
        Hide
        Mark Miller added a comment -

        Thanks Yonik!

        Show
        Mark Miller added a comment - Thanks Yonik!
        Hide
        Mark Miller added a comment -

        The commit bot got knocked out because JIRA started responding that it was not authorized.

        committed:
        5x 1497020
        4x 1497022

        Show
        Mark Miller added a comment - The commit bot got knocked out because JIRA started responding that it was not authorized. committed: 5x 1497020 4x 1497022
        Hide
        Mark Miller added a comment -

        Yonik follow up with

        4x 1497055 get latest commit from deletion policy
        5x 1497054 get latest commit from deletion policy

        Show
        Mark Miller added a comment - Yonik follow up with 4x 1497055 get latest commit from deletion policy 5x 1497054 get latest commit from deletion policy
        Hide
        ASF subversion and git services added a comment -

        Commit 1498554 from Yonik Seeley
        [ https://svn.apache.org/r1498554 ]

        SOLR-4926: SolrDeletionPolicy.getLatestCommit can be null before IW is used

        Show
        ASF subversion and git services added a comment - Commit 1498554 from Yonik Seeley [ https://svn.apache.org/r1498554 ] SOLR-4926 : SolrDeletionPolicy.getLatestCommit can be null before IW is used
        Hide
        ASF subversion and git services added a comment -

        Commit 1498559 from Yonik Seeley
        [ https://svn.apache.org/r1498559 ]

        SOLR-4926: SolrDeletionPolicy.getLatestCommit can be null before IW is used

        Show
        ASF subversion and git services added a comment - Commit 1498559 from Yonik Seeley [ https://svn.apache.org/r1498559 ] SOLR-4926 : SolrDeletionPolicy.getLatestCommit can be null before IW is used
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development