In our automated tests, we are seeing intermittent failures in TestZKFailoverController. I have been unable to reproduce the failures locally, but in examining the code, I found a difference that may explain the failures.
HDFS-6440 ( Support more than 2 NameNodes. Contributed by Jesse Yates.) was checked in before HADOOP-11149. TestZKFailoverController times out), which changed the test added in HDFS-6440.
In branch-2, the order was reversed, and the test that was added in
HDFS-6440 does not retain the fixes from HADOOP-11149.
Note that there was also a change from
HDFS-10985. (o.a.h.ha.TestZKFailoverController should not use fixed time sleep before assertions.) that was missed in the HDFS-6440 backport.
My proposal is to restore the changes from
HADOOP-11149. I made this change internally and it seems to have fixed the intermittent failures.