Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6228

TestReplicationHandler.doTestIndexAndConfigReplication failure on jenkins

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      I recently saw a failure on jenkins:

      http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/4077/

      REGRESSION:  org.apache.solr.handler.TestReplicationHandler.doTestIndexAndConfigReplication
      
      Error Message:
      expected:<0> but was:<1404646041002>
      
      Stack Trace:
      java.lang.AssertionError: expected:<0> but was:<1404646041002>
              at __randomizedtesting.SeedInfo.seed([7B532BE2B6B38F19:6F1B70B795B43207]:0)
              at org.junit.Assert.fail(Assert.java:93)
              at org.junit.Assert.failNotEquals(Assert.java:647)
              at org.junit.Assert.assertEquals(Assert.java:128)
              at org.junit.Assert.assertEquals(Assert.java:147)
              at org.apache.solr.handler.TestReplicationHandler.assertVersions(TestReplicationHandler.java:987)
              at org.apache.solr.handler.TestReplicationHandler.doTestIndexAndConfigReplication(TestReplicationHandler.java:559)
      

        Activity

        Show
        mkhludnev Mikhail Khludnev added a comment - follow up https://issues.apache.org/jira/browse/SOLR-11673
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        When test pass.

        18457 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 1
        18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 0
        18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 1
        18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 0
        18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher New index in Master. Deleting mine...

        19397 INFO (qtp254677725-59) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0
        19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 2
        19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 1511707221183
        19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 2
        19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 1511707221131
        19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Starting replication process
        19441 INFO (qtp254677725-65) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={generation=2&qt=/replication&tlogFiles=false&wt=javabin&version=2&command=filelist} status=0 QTime=38
        19442 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Number of files in latest index in master: 10

        Failed test logs

        [junit4] 2> 2033760 INFO (qtp323196139-22495) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={qt=/replication&wt=javabin&version=2&command=indexver
        sion} status=0 QTime=0
        [junit4] 2> 2033760 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 1
        [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 0
        [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 1
        [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 0
        [junit4] 2> 2033761 INFO (qtp323196139-22495) [ x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/solr path=/update params={wt=javabin&version=2}{add=[2000]} 0 0
        [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher New index in Master. Deleting mine...
        [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.u.DirectUpdateHandler2 start commit

        Unknown macro: {,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,so ftCommit=false,prepareCommit=false}

        [junit4] 2> 2034757 INFO (qtp323196139-22497) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params=

        Unknown macro: {qt=/replication&wt=javabin&version=2&command=indexversion}

        status=0 QTime=0
        [junit4] 2> 2034757 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 2
        [junit4] 2> 2034757 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 1507332180815
        [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 2
        [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 1507332180815
        [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave in sync with master.
        [junit4] 2> 2034780 INFO (qtp2142078059-22517) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/select params=

        Unknown macro: {q=*}

        hits=0 status=0 QTime=0
        [junit4] 2> 2034781 INFO (TEST-TestReplicationHandler.doTestIndexAndConfigReplication-seed#[23368D6C872259F3]) [ ] o.a.s.h.TestReplicationHandler Waiting for 1 docs

        If index version is a timestamp, It seems like slave nuked its' index at the same moment as master commits. Thus, we can just postpone master commit until currentTimeMillis() has changed. WDYT?

        Show
        mkhludnev Mikhail Khludnev added a comment - When test pass. 18457 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 1 18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 0 18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 1 18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 0 18462 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher New index in Master. Deleting mine... 19397 INFO (qtp254677725-59) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 2 19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 1511707221183 19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 2 19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 1511707221131 19401 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Starting replication process 19441 INFO (qtp254677725-65) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={generation=2&qt=/replication&tlogFiles=false&wt=javabin&version=2&command=filelist} status=0 QTime=38 19442 INFO (indexFetcher-47-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Number of files in latest index in master: 10 Failed test logs [junit4] 2> 2033760 INFO (qtp323196139-22495) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params={qt=/replication&wt=javabin&version=2&command=indexver sion} status=0 QTime=0 [junit4] 2> 2033760 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 1 [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 0 [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 1 [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 0 [junit4] 2> 2033761 INFO (qtp323196139-22495) [ x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/solr path=/update params={wt=javabin&version=2}{add= [2000] } 0 0 [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher New index in Master. Deleting mine... [junit4] 2> 2033761 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.u.DirectUpdateHandler2 start commit Unknown macro: {,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,so ftCommit=false,prepareCommit=false} [junit4] 2> 2034757 INFO (qtp323196139-22497) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/replication params= Unknown macro: {qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 [junit4] 2> 2034757 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's generation: 2 [junit4] 2> 2034757 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Master's version: 1507332180815 [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's generation: 2 [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave's version: 1507332180815 [junit4] 2> 2034758 INFO (indexFetcher-9020-thread-1) [ x:collection1] o.a.s.h.IndexFetcher Slave in sync with master. [junit4] 2> 2034780 INFO (qtp2142078059-22517) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/select params= Unknown macro: {q=*} hits=0 status=0 QTime=0 [junit4] 2> 2034781 INFO (TEST-TestReplicationHandler.doTestIndexAndConfigReplication-seed# [23368D6C872259F3] ) [ ] o.a.s.h.TestReplicationHandler Waiting for 1 docs If index version is a timestamp, It seems like slave nuked its' index at the same moment as master commits. Thus, we can just postpone master commit until currentTimeMillis() has changed. WDYT?
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        aha. I noticed

        o.a.s.h.TestReplicationHandler Waited for 30000ms and found 0 docs

        Show
        mkhludnev Mikhail Khludnev added a comment - aha. I noticed o.a.s.h.TestReplicationHandler Waited for 30000ms and found 0 docs
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        I just tried to launch solr tests got TestReplicationHandler.doTestIndexAndConfigReplicationfailure and reproduced it easily with beast. TestReplicationHandler-beast-failure.txt is it known one?

        Show
        mkhludnev Mikhail Khludnev added a comment - I just tried to launch solr tests got TestReplicationHandler.doTestIndexAndConfigReplicationfailure and reproduced it easily with beast. TestReplicationHandler-beast-failure.txt is it known one?
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I haven't seen this fail since my commit on 8th July. We can re-open it if necessary.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I haven't seen this fail since my commit on 8th July. We can re-open it if necessary.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1610450 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1610450 ]

        SOLR-6228: Fixed bug in TestReplicationHandler.doTestIndexAndConfigReplication

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1610450 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1610450 ] SOLR-6228 : Fixed bug in TestReplicationHandler.doTestIndexAndConfigReplication
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1610448 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1610448 ]

        SOLR-6228: Fixed bug in TestReplicationHandler.doTestIndexAndConfigReplication

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1610448 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1610448 ] SOLR-6228 : Fixed bug in TestReplicationHandler.doTestIndexAndConfigReplication
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I think this is a timing issue in the test:

         [junit4]   2> 1250539 T5542 C2538 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=getDetails&qt=/replication&wt=javabin&version=2&command=details} status=0 QTime=20 
           [junit4]   2> 1250539 T5550 oash.SnapPuller.fetchLatestIndex Configuration files are modified, core will be reloaded
           [junit4]   2> 1250539 T5550 oasu.DefaultSolrCoreState.openIndexWriter Creating new IndexWriter...
           [junit4]   2> 1250540 T5550 oasu.RandomMergePolicy.<init> RandomMergePolicy wrapping class org.apache.lucene.index.LogByteSizeMergePolicy: [LogByteSizeMergePolicy: minMergeSize=1677721, mergeFactor=20, maxMergeSize=2147483648, maxMergeSizeForForcedMerge=9223372036854775807, calibrateSizeByDeletes=false, maxMergeDocs=2147483647, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.4184992878575281]
           [junit4]   2> 1250541 T5550 oasc.SolrDeletionPolicy.onInit SolrDeletionPolicy.onInit: commits: num=2
           [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@3859e74f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7ef8b356),segFN=segments_1,generation=1}
           [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@3859e74f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7ef8b356),segFN=segments_2,generation=2}
           [junit4]   2> 1250541 T5550 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
           [junit4]   2> 1250542 T5550 oasu.DefaultSolrCoreState.openIndexWriter New IndexWriter is ready to be used.
           [junit4]   2> 1250542 T5553 oasc.SolrResourceLoader.<init> new SolrResourceLoader for directory: '.\solr.handler.TestReplicationHandler-7B532BE2B6B38F19-001\solr-instance-002\collection1\'
           [junit4]   2> 1250592 T5553 oasc.SolrConfig.<init> Using Lucene MatchVersion: LUCENE_4_10
           [junit4]   2> ASYNC  NEW_CORE C2539 name=collection1 org.apache.solr.core.SolrCore@6ecac580
           [junit4]   2> 1250622 T5528 C2539 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=assertVersions&qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 
           [junit4]   2> 1250628 T5553 oasc.SolrConfig.<init> Loaded SolrConfig: solrconfig.xml
           [junit4]   2> 1250628 T5553 oass.IndexSchema.readSchema Reading Solr Schema from schema.xml
           [junit4]   2> 1250632 T5553 oass.IndexSchema.readSchema [collection1] Schema name=test
           [junit4]   2> ASYNC  NEW_CORE C2540 name=collection1 org.apache.solr.core.SolrCore@6efe46f0
           [junit4]   2> 1250682 T5544 C2540 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=assertVersions&qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 
           [junit4]   2> 1250683 T5255 oas.SolrTestCaseJ4.tearDown ###Ending doTestIndexAndConfigReplication
        

        In this log:

        1. 1250539 - the getDetails call on the slave happens
        2. The slave core reloads
        3. 1250682 - the indexversion call on the slave happens

        The test should have failed earlier if assertVersions did not have:

        if (maxVersionClient1 > 0 && maxVersionClient2 > 0) {
              assertEquals(maxVersionClient1, maxVersionClient2);
            }
        

        I think a simple fix is to move the assertVersions call after the rQuery call on the slave to make sure that the replication is complete.

        I have committed this fix to trunk and branch_4x but it is not recorded automatically here because Jira was down for maintenance at that time.

        http://svn.apache.org/r1608601
        http://svn.apache.org/r1608602

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I think this is a timing issue in the test: [junit4] 2> 1250539 T5542 C2538 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=getDetails&qt=/replication&wt=javabin&version=2&command=details} status=0 QTime=20 [junit4] 2> 1250539 T5550 oash.SnapPuller.fetchLatestIndex Configuration files are modified, core will be reloaded [junit4] 2> 1250539 T5550 oasu.DefaultSolrCoreState.openIndexWriter Creating new IndexWriter... [junit4] 2> 1250540 T5550 oasu.RandomMergePolicy.<init> RandomMergePolicy wrapping class org.apache.lucene.index.LogByteSizeMergePolicy: [LogByteSizeMergePolicy: minMergeSize=1677721, mergeFactor=20, maxMergeSize=2147483648, maxMergeSizeForForcedMerge=9223372036854775807, calibrateSizeByDeletes= false , maxMergeDocs=2147483647, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.4184992878575281] [junit4] 2> 1250541 T5550 oasc.SolrDeletionPolicy.onInit SolrDeletionPolicy.onInit: commits: num=2 [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@3859e74f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7ef8b356),segFN=segments_1,generation=1} [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@3859e74f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7ef8b356),segFN=segments_2,generation=2} [junit4] 2> 1250541 T5550 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2 [junit4] 2> 1250542 T5550 oasu.DefaultSolrCoreState.openIndexWriter New IndexWriter is ready to be used. [junit4] 2> 1250542 T5553 oasc.SolrResourceLoader.<init> new SolrResourceLoader for directory: '.\solr.handler.TestReplicationHandler-7B532BE2B6B38F19-001\solr-instance-002\collection1\' [junit4] 2> 1250592 T5553 oasc.SolrConfig.<init> Using Lucene MatchVersion: LUCENE_4_10 [junit4] 2> ASYNC NEW_CORE C2539 name=collection1 org.apache.solr.core.SolrCore@6ecac580 [junit4] 2> 1250622 T5528 C2539 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=assertVersions&qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 [junit4] 2> 1250628 T5553 oasc.SolrConfig.<init> Loaded SolrConfig: solrconfig.xml [junit4] 2> 1250628 T5553 oass.IndexSchema.readSchema Reading Solr Schema from schema.xml [junit4] 2> 1250632 T5553 oass.IndexSchema.readSchema [collection1] Schema name=test [junit4] 2> ASYNC NEW_CORE C2540 name=collection1 org.apache.solr.core.SolrCore@6efe46f0 [junit4] 2> 1250682 T5544 C2540 oasc.SolrCore.execute [collection1] webapp=/solr path=/replication params={_trace=assertVersions&qt=/replication&wt=javabin&version=2&command=indexversion} status=0 QTime=0 [junit4] 2> 1250683 T5255 oas.SolrTestCaseJ4.tearDown ###Ending doTestIndexAndConfigReplication In this log: 1250539 - the getDetails call on the slave happens The slave core reloads 1250682 - the indexversion call on the slave happens The test should have failed earlier if assertVersions did not have: if (maxVersionClient1 > 0 && maxVersionClient2 > 0) { assertEquals(maxVersionClient1, maxVersionClient2); } I think a simple fix is to move the assertVersions call after the rQuery call on the slave to make sure that the replication is complete. I have committed this fix to trunk and branch_4x but it is not recorded automatically here because Jira was down for maintenance at that time. http://svn.apache.org/r1608601 http://svn.apache.org/r1608602

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            shalinmangar Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development