In ZKFailoverController.java, the Exception caught by the run() method does not have a single error log. This causes latent problems that are only manifested during failover.
An Exception is thrown from the doRun() method during initHM() (caused by a configuration error). If you want to repeat, you can set
"ha.health-monitor.connect-retry-interval.ms" to be any nonsensical value.
The Exception is caught in the run() method, as follows,
Unfortunately, the Exception (causing the shutdown of the process) is not logged at all. This causes latent errors which is only manifested during failover (because ZKFC is dead). The tricky thing here is that everything looks perfectly fine: the jps command shows a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.
We strongly suggest to add a error log to notify the error caught, such as,
— hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (revision 1641307)
+++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (working copy)