For problem #1, the solution is the same as is already done in some other test cases. We just need to add a workaround to clear the ZK MBeans before running the tearDown method. It's a hack, but in the absense of a fix for ZOOKEEPER-1438, it's about all we can do.
I spent some time investigating problem #2. The bug is as follows:
- these test cases create a new ActiveStandbyElector, and call ActiveStandbyElector.ensureBaseNode() on it before running the main body of the tests. Although they don't call joinElection(), the creation of the elector does create a zkClient object with an associated Watcher.
- in the testZookeeperFailure test case, we shut down and restart ZK. This causes the above Watcher instance to fire its Disconnected and then Connected events. There was a bug in the handling of the Connected event that would cause it to re-monitor the lock znode regardless of whether it was previously in the election.
- So, when ZK comes back up, there was not two but three electors racing for the lock. However, two of the electors actually corresponded to the same dummy service. In some cases this race would be resolved in such a way that the test timed out.
I don't think this is a problem in practice, since the "formatZK" call runs in its own JVM in the current code. However, it's worth fixing to get the tests to not be flaky, and to have a more reasonable behavior. There are several fixes to be done:
- Add extra asserts for wantToBeInElection to catch cases where we might accidentally re-join the election when we weren't supposed to be in it.
- Fix the handling of the "Connected" event to only re-join if the elector wants to be in the election
- Cause exceptions thrown by watcher callbacks to be propagated back as fatal errors
Will post a patch momentarily.