Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.5.1
-
None
-
Reviewed
Description
In ZKFailoverController.java, the Exception caught by the run() method does not have a single error log. This causes latent problems that are only manifested during failover.
The problem we encountered
An Exception is thrown from the doRun() method during initHM() (caused by a configuration error). If you want to repeat, you can set
"ha.health-monitor.connect-retry-interval.ms" to be any nonsensical value.
private int doRun(String[] args) ... initRPC(); initHM(); startRPC(); .... }
The Exception is caught in the run() method, as follows,
public int run(final String[] args) throws Exception { ... try { ... @Override public Integer run() { try { return doRun(args); } catch (Exception t) { throw new RuntimeException(t); } finally { if (elector != null) { elector.terminateConnection(); } } } }); } catch (RuntimeException rte) { throw (Exception)rte.getCause(); } }
Unfortunately, the Exception (causing the shutdown of the process) is not logged at all. This causes latent errors which is only manifested during failover (because ZKFC is dead). The tricky thing here is that everything looks perfectly fine: the jps command shows a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.
Patch
We strongly suggest to add a error log to notify the error caught, such as,
— hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (revision 1641307)
+++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (working copy)
} }); } catch (RuntimeException rte) { + LOG.fatal("The failover controller encounters runtime error: " + rte); throw (Exception)rte.getCause(); } }
Thanks!