Hadoop Common
  1. Hadoop Common
  2. HADOOP-8245

Fix flakiness in TestZKFailoverController

    Details

      Description

      When I loop TestZKFailoverController, I occasionally see two types of failures:
      1) the ZK JMXEnv issue (ZOOKEEPER-1438)
      2) TestZKFailoverController.testZooKeeperFailure fails with a timeout

      This JIRA is for fixes for these issues.

      1. hadoop-8245.txt
        21 kB
        Todd Lipcon

        Activity

        Hide
        Todd Lipcon added a comment -

        For problem #1, the solution is the same as is already done in some other test cases. We just need to add a workaround to clear the ZK MBeans before running the tearDown method. It's a hack, but in the absense of a fix for ZOOKEEPER-1438, it's about all we can do.

        I spent some time investigating problem #2. The bug is as follows:

        • these test cases create a new ActiveStandbyElector, and call ActiveStandbyElector.ensureBaseNode() on it before running the main body of the tests. Although they don't call joinElection(), the creation of the elector does create a zkClient object with an associated Watcher.
        • in the testZookeeperFailure test case, we shut down and restart ZK. This causes the above Watcher instance to fire its Disconnected and then Connected events. There was a bug in the handling of the Connected event that would cause it to re-monitor the lock znode regardless of whether it was previously in the election.
        • So, when ZK comes back up, there was not two but three electors racing for the lock. However, two of the electors actually corresponded to the same dummy service. In some cases this race would be resolved in such a way that the test timed out.

        I don't think this is a problem in practice, since the "formatZK" call runs in its own JVM in the current code. However, it's worth fixing to get the tests to not be flaky, and to have a more reasonable behavior. There are several fixes to be done:

        • Add extra asserts for wantToBeInElection to catch cases where we might accidentally re-join the election when we weren't supposed to be in it.
        • Fix the handling of the "Connected" event to only re-join if the elector wants to be in the election
        • Cause exceptions thrown by watcher callbacks to be propagated back as fatal errors

        Will post a patch momentarily.

        Show
        Todd Lipcon added a comment - For problem #1, the solution is the same as is already done in some other test cases. We just need to add a workaround to clear the ZK MBeans before running the tearDown method. It's a hack, but in the absense of a fix for ZOOKEEPER-1438 , it's about all we can do. I spent some time investigating problem #2. The bug is as follows: these test cases create a new ActiveStandbyElector, and call ActiveStandbyElector.ensureBaseNode() on it before running the main body of the tests. Although they don't call joinElection() , the creation of the elector does create a zkClient object with an associated Watcher. in the testZookeeperFailure test case, we shut down and restart ZK. This causes the above Watcher instance to fire its Disconnected and then Connected events. There was a bug in the handling of the Connected event that would cause it to re-monitor the lock znode regardless of whether it was previously in the election. So, when ZK comes back up, there was not two but three electors racing for the lock. However, two of the electors actually corresponded to the same dummy service. In some cases this race would be resolved in such a way that the test timed out. I don't think this is a problem in practice, since the "formatZK" call runs in its own JVM in the current code. However, it's worth fixing to get the tests to not be flaky, and to have a more reasonable behavior. There are several fixes to be done: Add extra asserts for wantToBeInElection to catch cases where we might accidentally re-join the election when we weren't supposed to be in it. Fix the handling of the "Connected" event to only re-join if the elector wants to be in the election Cause exceptions thrown by watcher callbacks to be propagated back as fatal errors Will post a patch momentarily.
        Hide
        Todd Lipcon added a comment -

        Attached patch fixes the flaky behavior for me. I looped TestZKFailoverController for 30+ minutes and didn't see failures after applying this.

        Show
        Todd Lipcon added a comment - Attached patch fixes the flaky behavior for me. I looped TestZKFailoverController for 30+ minutes and didn't see failures after applying this.
        Hide
        Aaron T. Myers added a comment -

        +1, the patch looks good to me. Good finds/fixes, Todd.

        Show
        Aaron T. Myers added a comment - +1, the patch looks good to me. Good finds/fixes, Todd.
        Hide
        Todd Lipcon added a comment -

        Committed to branch, thanks for reviewing. I verified all the HA tests passed on branch before committing.

        Show
        Todd Lipcon added a comment - Committed to branch, thanks for reviewing. I verified all the HA tests passed on branch before committing.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development