Solr
  1. Solr
  2. SOLR-7421

RecoveryAfterSoftCommitTest fails frequently on Jenkins due to full index replication taking longer than 30 seconds

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      RecoveryAfterSoftCommitTest is failing frequently on Jenkins because the test only gives 30 seconds for the replica to recover after healing the partition. It looks like it's taking >30 seconds to replicate the full index from the leader (the test is designed so that peer sync can't work). It seems bad that it takes >30 seconds to replicate an index with only 115 documents in it ... wonder if there is cruft laying around from other tests? I've run beast on this test locally and it always passes. What's weird is I see log messages like:

         [junit4]   2> 1436627 T4242 N:127.0.0.1:63274_ecb%2Fay C476 oash.IndexFetcher.fetchLatestIndex Number of files in latest index in master: 263
      

      263 files for an index with 115 docs? Doesn't seem right!

      1. RecoveryAfterSoftCommitTest_failure.log
        347 kB
        Timothy Potter
      2. SOLR-7421.patch
        4 kB
        Shalin Shekhar Mangar
      3. SOLR-7421.patch
        4 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Timothy Potter added a comment -

          Full log for the failing test from a recent Jenkins failure - https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2969/

          Show
          Timothy Potter added a comment - Full log for the failing test from a recent Jenkins failure - https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2969/
          Hide
          ASF subversion and git services added a comment -

          Commit 1674512 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1674512 ]

          SOLR-7421: Marking test as a BadApple for now until we can figure out what is causing replication to take so long for a small index

          Show
          ASF subversion and git services added a comment - Commit 1674512 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1674512 ] SOLR-7421 : Marking test as a BadApple for now until we can figure out what is causing replication to take so long for a small index
          Hide
          ASF subversion and git services added a comment -

          Commit 1674516 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1674516 ]

          SOLR-7421: Marking test as a BadApple for now until we can figure out what is causing replication to take so long for a small index

          Show
          ASF subversion and git services added a comment - Commit 1674516 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674516 ] SOLR-7421 : Marking test as a BadApple for now until we can figure out what is causing replication to take so long for a small index
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks for looking into this, Tim.

          This test was written to force flush a segment and then cause softCommit and then trigger a full replication. In order to force flush a segment, the test sets maxBufferedDocs to 2 which is why it creates so many files. Now that we have the ability to change the peer sync limit via SOLR-6359, we should set the limit to a small value (say n=4) and only add n+1 documents to simulate the same behavior without creating such a lot of index files.

          Show
          Shalin Shekhar Mangar added a comment - Thanks for looking into this, Tim. This test was written to force flush a segment and then cause softCommit and then trigger a full replication. In order to force flush a segment, the test sets maxBufferedDocs to 2 which is why it creates so many files. Now that we have the ability to change the peer sync limit via SOLR-6359 , we should set the limit to a small value (say n=4) and only add n+1 documents to simulate the same behavior without creating such a lot of index files.
          Hide
          Shalin Shekhar Mangar added a comment -

          Also for this particular seed, the merge policy chosen is LogDocMergePolicy with minMergeSize=1000 so no merges happen at all.

             [junit4]   2> 752435 T2587 N:127.0.0.1:42328_ c:collection1 oasu.RandomMergePolicy.<init> RandomMergePolicy wrapping class org.apache.lucene.index.LogDocMergePolicy: [LogDocMergePolicy: minMergeSize=1000, mergeFactor=28, maxMergeSize=9223372036854775807, maxMergeSizeForForcedMerge=9223372036854775807, calibrateSizeByDeletes=true, maxMergeDocs=2147483647, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.4225057358391613]
          
          Show
          Shalin Shekhar Mangar added a comment - Also for this particular seed, the merge policy chosen is LogDocMergePolicy with minMergeSize=1000 so no merges happen at all. [junit4] 2> 752435 T2587 N:127.0.0.1:42328_ c:collection1 oasu.RandomMergePolicy.<init> RandomMergePolicy wrapping class org.apache.lucene.index.LogDocMergePolicy: [LogDocMergePolicy: minMergeSize=1000, mergeFactor=28, maxMergeSize=9223372036854775807, maxMergeSizeForForcedMerge=9223372036854775807, calibrateSizeByDeletes= true , maxMergeDocs=2147483647, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.4225057358391613]
          Hide
          Shalin Shekhar Mangar added a comment -
          1. numRecordsToKeep is set to 2 instead of the default 100 so that we don't need to add so many records to trigger a full replication
          2. solr.cloud.wait-for-updates-with-stale-state-pause is set to 500ms for this test because it needlessly adds a 7 second delay and is not useful for this test.
          3. I changed the BadApple to AwaitsFix so that the test isn't skipped on jenkins.
          Show
          Shalin Shekhar Mangar added a comment - numRecordsToKeep is set to 2 instead of the default 100 so that we don't need to add so many records to trigger a full replication solr.cloud.wait-for-updates-with-stale-state-pause is set to 500ms for this test because it needlessly adds a 7 second delay and is not useful for this test. I changed the BadApple to AwaitsFix so that the test isn't skipped on jenkins.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Use compoundFile=true to cut down on the number of files created.
          2. AwaitsFix also prevents jenkins from running this test so I removed it
          3. I checked out the revision before SOLR-6640 was fixed and ran the modified test (with compoundFile=true) and verified that the test still reproduces the corrupt index exception.
          4. I beasted the test overnight on slow and fast hardware and was not able to get it to fail

          I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Use compoundFile=true to cut down on the number of files created. AwaitsFix also prevents jenkins from running this test so I removed it I checked out the revision before SOLR-6640 was fixed and ran the modified test (with compoundFile=true) and verified that the test still reproduces the corrupt index exception. I beasted the test overnight on slow and fast hardware and was not able to get it to fail I'll commit this shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1674733 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1674733 ]

          SOLR-7421: RecoveryAfterSoftCommitTest fails frequently on Jenkins due to full index replication taking longer than 30 seconds

          Show
          ASF subversion and git services added a comment - Commit 1674733 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1674733 ] SOLR-7421 : RecoveryAfterSoftCommitTest fails frequently on Jenkins due to full index replication taking longer than 30 seconds
          Hide
          ASF subversion and git services added a comment -

          Commit 1674734 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1674734 ]

          SOLR-7421: RecoveryAfterSoftCommitTest fails frequently on Jenkins due to full index replication taking longer than 30 seconds

          Show
          ASF subversion and git services added a comment - Commit 1674734 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674734 ] SOLR-7421 : RecoveryAfterSoftCommitTest fails frequently on Jenkins due to full index replication taking longer than 30 seconds
          Hide
          Shalin Shekhar Mangar added a comment -

          Last failure was 8 days ago.

          Show
          Shalin Shekhar Mangar added a comment - Last failure was 8 days ago.
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Timothy Potter
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development