Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
2.4.1, 2.6.0, 2.5.1
-
None
Description
============================
Problem
-------------------------------------------------
Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.
Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover.
Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process.
============================
Parameters
-------------------------------------------------
In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java
"dfs.ha.fencing.ssh.connect-timeout";
"dfs.ha.fencing.ssh.private-key-files"
They are used in the tryFence() function for auto-fencing.
Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover.
For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method.
The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter.
2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ====== 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at java.io.FileInputStream.<init>(FileInputStream.java:101) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) ... 14 more
============================
Solution (the patch)
-------------------------------------------------
Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.
/** * Verify that the argument, if given, in the conf is parseable. */ @Override public void checkArgs(String argStr) throws BadFencingConfigurationException { if (argStr != null) { new Args(argStr); } <= Insert the checkers here (see the patch attached) }
The detailed patch is shown below.
@@ -76,6 +77,23 @@ if (argStr != null) { new Args(argStr); } + + //The configuration could be empty (e.g., called from DFSHAAdmin) + if(getConf().size() > 0) { + //check ssh.connect-timeout + if(getSshConnectTimeout() <= 0) + throw new BadFencingConfigurationException( + CONF_CONNECT_TIMEOUT_KEY + + "property value should be positive and non-zero"); + + //check the settings of dfs.ha.fencing.ssh.private-key-files + for (String keyFilePath : getKeyFiles()) { + File keyFile = new File(keyFilePath); + if(!keyFile.isFile() || !keyFile.canRead()) + throw new BadFencingConfigurationException( + "The configured private key file is invalid: " + keyFilePath); + } + } } @Override
Thanks!