Solr
  1. Solr
  2. SOLR-7291

ChaosMonkey should create more mayhem with ZK availability

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Some things ChaosMonkey and it's users can do:

      • It should stop, pause and restart ZK once in a while
      • Tests using CM should check with indexing stalling when this happens
      1. SOLR-7291.patch
        31 kB
        Ramkumar Aiyengar
      2. SOLR-7291.patch
        20 kB
        Ramkumar Aiyengar

        Activity

        Hide
        Ramkumar Aiyengar added a comment -

        Initial patch, this adds a ZK stop/pause/restart, while restricting so timeout for the cloudClient to a value less than the pause. If indexing stalls, this should trigger soTimeout's and fail indexing.. (or so I think..)

        Ideally I would like to move this to chaosMonkey.. I am for some reason not able to reuse the zkServer object to stop/restart ZK, which will make this a lot simpler. Not sure why just

        zkServer.shutdown();
        zkServer.run()
        

        doesn't work..

        Also, did s/Stopable/Stoppable/ for a few classes..

        Show
        Ramkumar Aiyengar added a comment - Initial patch, this adds a ZK stop/pause/restart, while restricting so timeout for the cloudClient to a value less than the pause. If indexing stalls, this should trigger soTimeout's and fail indexing.. (or so I think..) Ideally I would like to move this to chaosMonkey.. I am for some reason not able to reuse the zkServer object to stop/restart ZK, which will make this a lot simpler. Not sure why just zkServer.shutdown(); zkServer.run() doesn't work.. Also, did s/Stopable/Stoppable/ for a few classes..
        Hide
        Ramkumar Aiyengar added a comment -

        Does anyone know the intended difference between the ChaosMonkeyNothingIsSafeTest and ChaosMonkeySafeLeaderTest? Is it just that in one case the leaders can be killed while the leaders are safe in the other? If that's the case, the tests seem to be subtly diverged. In any case, given that leaders are only randomly killed, wouldn't the first be a superset of the other?

        Show
        Ramkumar Aiyengar added a comment - Does anyone know the intended difference between the ChaosMonkeyNothingIsSafeTest and ChaosMonkeySafeLeaderTest ? Is it just that in one case the leaders can be killed while the leaders are safe in the other? If that's the case, the tests seem to be subtly diverged. In any case, given that leaders are only randomly killed, wouldn't the first be a superset of the other?
        Hide
        Mark Miller added a comment -

        There are various differences between them. I think only one of them uses ConcurrentSolrClient for example.

        The basic idea is that one will never kill leaders and one will. It's not really intended that they stay in sync.

        One is focused on finding and highlighting issues with recovery that should not involve leader election.

        The other one is for more full, anything goes stuff.

        Show
        Mark Miller added a comment - There are various differences between them. I think only one of them uses ConcurrentSolrClient for example. The basic idea is that one will never kill leaders and one will. It's not really intended that they stay in sync. One is focused on finding and highlighting issues with recovery that should not involve leader election. The other one is for more full, anything goes stuff.
        Hide
        Mark Miller added a comment -

        In any case, given that leaders are only randomly killed, wouldn't the first be a superset of the other?

        One super set test makes it very hard to debug and fix issues. I actually run variations of both of these tests on my jenkins when hunting down failures so that I can narrow down what behavior things fail under. I'd have a lot more of them focused on more subsets, but even these 2 get so little time that it's just not worth it yet. Trying to separate out leader election at the high level has proved very helpful so far though.

        Anyway, when the safe leader test fails and the leader kill test is not failing, you can bet you get to just focus on the recovery from leader path. When leaders go down in these tests, it's also many times hard to catch an issue as the leader sync sequence can repair and hide problems.

        The fails can be so infrequent, to hunt them you need either / or a test beasting script and jenkins running just the chaosmonkey tests. I run them in a few variations, nightly and regular. When my local jenkins machine is up and running that is.

        Show
        Mark Miller added a comment - In any case, given that leaders are only randomly killed, wouldn't the first be a superset of the other? One super set test makes it very hard to debug and fix issues. I actually run variations of both of these tests on my jenkins when hunting down failures so that I can narrow down what behavior things fail under. I'd have a lot more of them focused on more subsets, but even these 2 get so little time that it's just not worth it yet. Trying to separate out leader election at the high level has proved very helpful so far though. Anyway, when the safe leader test fails and the leader kill test is not failing, you can bet you get to just focus on the recovery from leader path. When leaders go down in these tests, it's also many times hard to catch an issue as the leader sync sequence can repair and hide problems. The fails can be so infrequent, to hunt them you need either / or a test beasting script and jenkins running just the chaosmonkey tests. I run them in a few variations, nightly and regular. When my local jenkins machine is up and running that is.
        Hide
        Ramkumar Aiyengar added a comment -

        Thanks Mark, that explains why the two tests should run in the same test run. What that also means is that it should be possible to reuse the parts which should be the same.. From what you describe, there are two aspects to this test:

        • What gets done (indexing, searching, "full throttle" operations using cuss)
        • What gets monkeyed with (slaves, leaders, ZK..)

        You need to pick a subset of stuff from each line and run the test. Currently we do two configurations, but there could be more if we need them..

        I will see if I can change the code to reflect that. Currently there are differences in how the tests are set up outside these params, which seems unintended.

        Show
        Ramkumar Aiyengar added a comment - Thanks Mark, that explains why the two tests should run in the same test run. What that also means is that it should be possible to reuse the parts which should be the same.. From what you describe, there are two aspects to this test: What gets done (indexing, searching, "full throttle" operations using cuss) What gets monkeyed with (slaves, leaders, ZK..) You need to pick a subset of stuff from each line and run the test. Currently we do two configurations, but there could be more if we need them.. I will see if I can change the code to reflect that. Currently there are differences in how the tests are set up outside these params, which seems unintended.
        Hide
        ASF subversion and git services added a comment -

        Commit 1669026 from Ramkumar Aiyengar in branch 'dev/trunk'
        [ https://svn.apache.org/r1669026 ]

        SOLR-7291: Test indexing on ZK disconnect with ChaosMonkey tests

        Show
        ASF subversion and git services added a comment - Commit 1669026 from Ramkumar Aiyengar in branch 'dev/trunk' [ https://svn.apache.org/r1669026 ] SOLR-7291 : Test indexing on ZK disconnect with ChaosMonkey tests
        Hide
        ASF subversion and git services added a comment -

        Commit 1669687 from Ramkumar Aiyengar in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1669687 ]

        SOLR-7291: Test indexing on ZK disconnect with ChaosMonkey tests

        Show
        ASF subversion and git services added a comment - Commit 1669687 from Ramkumar Aiyengar in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669687 ] SOLR-7291 : Test indexing on ZK disconnect with ChaosMonkey tests
        Hide
        Shalin Shekhar Mangar added a comment -

        This went into 5.1 right? Can we close this? Also, is this also the reason why we're seeing more chaos monkey failures these days?

        Show
        Shalin Shekhar Mangar added a comment - This went into 5.1 right? Can we close this? Also, is this also the reason why we're seeing more chaos monkey failures these days?
        Hide
        Ramkumar Aiyengar added a comment -

        I think so. I was looking to push the zk disconnect into ChaosMonkey class itself as a part of its loop after the current failures had reduced, but that can go in as a separate issue.
        .
        And no, I don't think this is the reason because the ZK disconnect happens outside the chaos loop after the 'we expect some failures' check is done. I did have to reduce the socket timeout and that could impact things, but again, the frequency of failures actually hasn't increased much as far as I can see after the change. The main increase was before..

        Show
        Ramkumar Aiyengar added a comment - I think so. I was looking to push the zk disconnect into ChaosMonkey class itself as a part of its loop after the current failures had reduced, but that can go in as a separate issue. . And no, I don't think this is the reason because the ZK disconnect happens outside the chaos loop after the 'we expect some failures' check is done. I did have to reduce the socket timeout and that could impact things, but again, the frequency of failures actually hasn't increased much as far as I can see after the change. The main increase was before..
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Ramkumar Aiyengar
            Reporter:
            Ramkumar Aiyengar
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development