Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6227

ChaosMonkeySafeLeaderTest failures on jenkins

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      This is happening very frequently.

      1 tests failed.
      REGRESSION:  org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.testDistribSearch
      
      Error Message:
      shard1 is not consistent.  Got 143 from https://127.0.0.1:36610/xvv/collection1lastClient and got 142 from https://127.0.0.1:33168/xvv/collection1
      
      Stack Trace:
      java.lang.AssertionError: shard1 is not consistent.  Got 143 from https://127.0.0.1:36610/xvv/collection1lastClient and got 142 from https://127.0.0.1:33168/xvv/collection1
              at __randomizedtesting.SeedInfo.seed([3C1FB6EADDDDFE71:BDF938F2AA829E4D]:0)
              at org.junit.Assert.fail(Assert.java:93)
              at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1139)
              at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1118)
              at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:150)
              at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
      

        Issue Links

          Activity

          Hide
          anshumg Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Cool, thanks Shalin.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Cool, thanks Shalin.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I'm going to close this long open issue here.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I'm going to close this long open issue here.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1657489 from shalin@apache.org in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1657489 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1657489 from shalin@apache.org in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1657489 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1657488 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1657488 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1657488 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1657488 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1657487 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1657487 ]

          SOLR-6227: Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1657487 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1657487 ] SOLR-6227 : Avoid spurious failures of ChaosMonkeySafeLeaderTest by ensuring there's at least one jetty to kill
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          More failures from jenkins:

          java.lang.AssertionError: The Monkey ran for over 20 seconds and no jetties were stopped - this is worth investigating!
          	at __randomizedtesting.SeedInfo.seed([3F5FA11431DFAF47:B70B9ECE9F23C2BF]:0)
          	at org.junit.Assert.fail(Assert.java:93)
          	at org.apache.solr.cloud.ChaosMonkey.stopTheMonkey(ChaosMonkey.java:537)
          	at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.test(ChaosMonkeySafeLeaderTest.java:137)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          

          I did investigate and it happens on nightly runs only and only when the shardCount is equal to sliceCount. In such cases, each slice has just one replica and ChaosMonkey has nothing to kill. I'll fix it by making sure that we create at least 1+sliceCount number of jetties so that there's always at least one node to kill.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - More failures from jenkins: java.lang.AssertionError: The Monkey ran for over 20 seconds and no jetties were stopped - this is worth investigating! at __randomizedtesting.SeedInfo.seed([3F5FA11431DFAF47:B70B9ECE9F23C2BF]:0) at org.junit.Assert.fail(Assert.java:93) at org.apache.solr.cloud.ChaosMonkey.stopTheMonkey(ChaosMonkey.java:537) at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.test(ChaosMonkeySafeLeaderTest.java:137) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) I did investigate and it happens on nightly runs only and only when the shardCount is equal to sliceCount. In such cases, each slice has just one replica and ChaosMonkey has nothing to kill. I'll fix it by making sure that we create at least 1+sliceCount number of jetties so that there's always at least one node to kill.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Yeah, that failure has been around - I get it occasionally. Just means there was a fail when you wouldn't expect it - we expect 0 to fail because we don't kill leaders - seems one can occasionally fail with: org.apache.http.NoHttpResponseException: The target server failed to respond. It's not necessarily illegal behavior, it just shouldn't really happen.

          Anyway, a much less concerning issue. As far as the inconsistency, which is totally illegal, like your report, I no longer see in my local jenkins jobs. Fantastic. Always a scary fail.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Yeah, that failure has been around - I get it occasionally. Just means there was a fail when you wouldn't expect it - we expect 0 to fail because we don't kill leaders - seems one can occasionally fail with: org.apache.http.NoHttpResponseException: The target server failed to respond. It's not necessarily illegal behavior, it just shouldn't really happen. Anyway, a much less concerning issue. As far as the inconsistency, which is totally illegal, like your report, I no longer see in my local jenkins jobs. Fantastic. Always a scary fail.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I haven't seen the failure mentioned in the issue description but my jenkins found the following failure yesterday:

          java.lang.AssertionError: expected:<0> but was:<1>
          	at __randomizedtesting.SeedInfo.seed([2D7931A1F137DAA5:AC9FBFB98668BA99]:0)
          	at org.junit.Assert.fail(Assert.java:93)
          	at org.junit.Assert.failNotEquals(Assert.java:647)
          	at org.junit.Assert.assertEquals(Assert.java:128)
          	at org.junit.Assert.assertEquals(Assert.java:472)
          	at org.junit.Assert.assertEquals(Assert.java:456)
          	at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:141)
          	at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:863)
          
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I haven't seen the failure mentioned in the issue description but my jenkins found the following failure yesterday: java.lang.AssertionError: expected:<0> but was:<1> at __randomizedtesting.SeedInfo.seed([2D7931A1F137DAA5:AC9FBFB98668BA99]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:141) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:863)
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've had my jenkins running all week as well (with some CM specific jobs as well), just have not checked up on them yet. I'll look and report back soon.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've had my jenkins running all week as well (with some CM specific jobs as well), just have not checked up on them yet. I'll look and report back soon.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I haven't seen this test fail ever since SOLR-6235 was committed. It is possible that the underlying issue was the same in both fails. My local jenkins is chugging along nicely but I haven't been able to reproduce this. I'll keep this open for a couple of days more and then close if I still can't reproduce the failure.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I haven't seen this test fail ever since SOLR-6235 was committed. It is possible that the underlying issue was the same in both fails. My local jenkins is chugging along nicely but I haven't been able to reproduce this. I'll keep this open for a couple of days more and then close if I still can't reproduce the failure.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          When I start up my local jenkins machine again (it's been off while I've been traveling), I will get some better logs (I have the logging tuned differently for CM tests) and attach them.

          Show
          markrmiller@gmail.com Mark Miller added a comment - When I start up my local jenkins machine again (it's been off while I've been traveling), I will get some better logs (I have the logging tuned differently for CM tests) and attach them.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've seen locally over that same period with no changes - probably 1 out of 10 or 1 out of 20 runs.

          1000 thousand things could be broken or off, no way to know without digging into the logs. After this much time, it's often multiple things. I have not had a chance to dig in yet, but eventually I will if no one else does.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've seen locally over that same period with no changes - probably 1 out of 10 or 1 out of 20 runs. 1000 thousand things could be broken or off, no way to know without digging into the logs. After this much time, it's often multiple things. I have not had a chance to dig in yet, but eventually I will if no one else does.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I can reproduce it locally by setting -Dsolr.tests.cloud.cm.runlength to a high value.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I can reproduce it locally by setting -Dsolr.tests.cloud.cm.runlength to a high value.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Yeah, this started months ago now, though it's become more frequent since since I've been away the past couple months.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Yeah, this started months ago now, though it's become more frequent since since I've been away the past couple months.

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development