Solr
  1. Solr
  2. SOLR-6227

ChaosMonkeySafeLeaderTest failures on jenkins

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      This is happening very frequently.

      1 tests failed.
      REGRESSION:  org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.testDistribSearch
      
      Error Message:
      shard1 is not consistent.  Got 143 from https://127.0.0.1:36610/xvv/collection1lastClient and got 142 from https://127.0.0.1:33168/xvv/collection1
      
      Stack Trace:
      java.lang.AssertionError: shard1 is not consistent.  Got 143 from https://127.0.0.1:36610/xvv/collection1lastClient and got 142 from https://127.0.0.1:33168/xvv/collection1
              at __randomizedtesting.SeedInfo.seed([3C1FB6EADDDDFE71:BDF938F2AA829E4D]:0)
              at org.junit.Assert.fail(Assert.java:93)
              at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1139)
              at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1118)
              at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:150)
              at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
      

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Yeah, this started months ago now, though it's become more frequent since since I've been away the past couple months.

          Show
          Mark Miller added a comment - Yeah, this started months ago now, though it's become more frequent since since I've been away the past couple months.
          Hide
          Shalin Shekhar Mangar added a comment -

          I can reproduce it locally by setting -Dsolr.tests.cloud.cm.runlength to a high value.

          Show
          Shalin Shekhar Mangar added a comment - I can reproduce it locally by setting -Dsolr.tests.cloud.cm.runlength to a high value.
          Hide
          Mark Miller added a comment -

          I've seen locally over that same period with no changes - probably 1 out of 10 or 1 out of 20 runs.

          1000 thousand things could be broken or off, no way to know without digging into the logs. After this much time, it's often multiple things. I have not had a chance to dig in yet, but eventually I will if no one else does.

          Show
          Mark Miller added a comment - I've seen locally over that same period with no changes - probably 1 out of 10 or 1 out of 20 runs. 1000 thousand things could be broken or off, no way to know without digging into the logs. After this much time, it's often multiple things. I have not had a chance to dig in yet, but eventually I will if no one else does.
          Hide
          Mark Miller added a comment -

          When I start up my local jenkins machine again (it's been off while I've been traveling), I will get some better logs (I have the logging tuned differently for CM tests) and attach them.

          Show
          Mark Miller added a comment - When I start up my local jenkins machine again (it's been off while I've been traveling), I will get some better logs (I have the logging tuned differently for CM tests) and attach them.
          Hide
          Shalin Shekhar Mangar added a comment -

          I haven't seen this test fail ever since SOLR-6235 was committed. It is possible that the underlying issue was the same in both fails. My local jenkins is chugging along nicely but I haven't been able to reproduce this. I'll keep this open for a couple of days more and then close if I still can't reproduce the failure.

          Show
          Shalin Shekhar Mangar added a comment - I haven't seen this test fail ever since SOLR-6235 was committed. It is possible that the underlying issue was the same in both fails. My local jenkins is chugging along nicely but I haven't been able to reproduce this. I'll keep this open for a couple of days more and then close if I still can't reproduce the failure.
          Hide
          Mark Miller added a comment -

          I've had my jenkins running all week as well (with some CM specific jobs as well), just have not checked up on them yet. I'll look and report back soon.

          Show
          Mark Miller added a comment - I've had my jenkins running all week as well (with some CM specific jobs as well), just have not checked up on them yet. I'll look and report back soon.
          Hide
          Shalin Shekhar Mangar added a comment -

          I haven't seen the failure mentioned in the issue description but my jenkins found the following failure yesterday:

          java.lang.AssertionError: expected:<0> but was:<1>
          	at __randomizedtesting.SeedInfo.seed([2D7931A1F137DAA5:AC9FBFB98668BA99]:0)
          	at org.junit.Assert.fail(Assert.java:93)
          	at org.junit.Assert.failNotEquals(Assert.java:647)
          	at org.junit.Assert.assertEquals(Assert.java:128)
          	at org.junit.Assert.assertEquals(Assert.java:472)
          	at org.junit.Assert.assertEquals(Assert.java:456)
          	at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:141)
          	at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:863)
          
          Show
          Shalin Shekhar Mangar added a comment - I haven't seen the failure mentioned in the issue description but my jenkins found the following failure yesterday: java.lang.AssertionError: expected:<0> but was:<1> at __randomizedtesting.SeedInfo.seed([2D7931A1F137DAA5:AC9FBFB98668BA99]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:141) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:863)
          Hide
          Mark Miller added a comment -

          Yeah, that failure has been around - I get it occasionally. Just means there was a fail when you wouldn't expect it - we expect 0 to fail because we don't kill leaders - seems one can occasionally fail with: org.apache.http.NoHttpResponseException: The target server failed to respond. It's not necessarily illegal behavior, it just shouldn't really happen.

          Anyway, a much less concerning issue. As far as the inconsistency, which is totally illegal, like your report, I no longer see in my local jenkins jobs. Fantastic. Always a scary fail.

          Show
          Mark Miller added a comment - Yeah, that failure has been around - I get it occasionally. Just means there was a fail when you wouldn't expect it - we expect 0 to fail because we don't kill leaders - seems one can occasionally fail with: org.apache.http.NoHttpResponseException: The target server failed to respond. It's not necessarily illegal behavior, it just shouldn't really happen. Anyway, a much less concerning issue. As far as the inconsistency, which is totally illegal, like your report, I no longer see in my local jenkins jobs. Fantastic. Always a scary fail.
          Hide
          Shalin Shekhar Mangar added a comment -

          More failures from jenkins:

          java.lang.AssertionError: The Monkey ran for over 20 seconds and no jetties were stopped - this is worth investigating!
          	at __randomizedtesting.SeedInfo.seed([3F5FA11431DFAF47:B70B9ECE9F23C2BF]:0)
          	at org.junit.Assert.fail(Assert.java:93)
          	at org.apache.solr.cloud.ChaosMonkey.stopTheMonkey(ChaosMonkey.java:537)
          	at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.test(ChaosMonkeySafeLeaderTest.java:137)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          

          I did investigate and it happens on nightly runs only and only when the shardCount is equal to sliceCount. In such cases, each slice has just one replica and ChaosMonkey has nothing to kill. I'll fix it by making sure that we create at least 1+sliceCount number of jetties so that there's always at least one node to kill.

          Show
          Shalin Shekhar Mangar added a comment - More failures from jenkins: java.lang.AssertionError: The Monkey ran for over 20 seconds and no jetties were stopped - this is worth investigating! at __randomizedtesting.SeedInfo.seed([3F5FA11431DFAF47:B70B9ECE9F23C2BF]:0) at org.junit.Assert.fail(Assert.java:93) at org.apache.solr.cloud.ChaosMonkey.stopTheMonkey(ChaosMonkey.java:537) at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.test(ChaosMonkeySafeLeaderTest.java:137) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) I did investigate and it happens on nightly runs only and only when the shardCount is equal to sliceCount. In such cases, each slice has just one replica and ChaosMonkey has nothing to kill. I'll fix it by making sure that we create at least 1+sliceCount number of jetties so that there's always at least one node to kill.
          Hide
          ASF subversion and git services added a comment -

          Commit 1657487 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1657487 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          ASF subversion and git services added a comment - Commit 1657487 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1657487 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          ASF subversion and git services added a comment -

          Commit 1657488 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1657488 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          ASF subversion and git services added a comment - Commit 1657488 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1657488 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          ASF subversion and git services added a comment -

          Commit 1657489 from shalin@apache.org in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1657489 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          ASF subversion and git services added a comment - Commit 1657489 from shalin@apache.org in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1657489 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          Shalin Shekhar Mangar added a comment -

          I'm going to close this long open issue here.

          Show
          Shalin Shekhar Mangar added a comment - I'm going to close this long open issue here.
          Hide
          Mark Miller added a comment -

          Cool, thanks Shalin.

          Show
          Mark Miller added a comment - Cool, thanks Shalin.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development