Hadoop HDFS
  Hadoop HDFS
  HDFS-7727

Check and verify the auto-fence settings to prevent failures of auto-failover



      Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.

      Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover.

      Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process.


      In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java

      They are used in the tryFence() function for auto-fencing.

      Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover.

      For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).

      For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method.

      The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter.

      2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
      2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
      2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session
      com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory)
              at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98)
              at com.jcraft.jsch.JSch.addIdentity(JSch.java:206)
              at com.jcraft.jsch.JSch.addIdentity(JSch.java:192)
              at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
              at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
              at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
              at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
              at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
              at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
              at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
              at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
      Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory)
              at java.io.FileInputStream.open(Native Method)
              at java.io.FileInputStream.<init>(FileInputStream.java:146)
              at java.io.FileInputStream.<init>(FileInputStream.java:101)
              at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83)
              ... 14 more

      Solution (the patch)

      Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.

         * Verify that the argument, if given, in the conf is parseable.
        public void checkArgs(String argStr) throws BadFencingConfigurationException {
          if (argStr != null) {
            new Args(argStr);
          <= Insert the checkers here (see the patch attached)

      The detailed patch is shown below.

      @@ -76,6 +77,23 @@
           if (argStr != null) {
             new Args(argStr);
      +    //The configuration could be empty (e.g., called from DFSHAAdmin)
      +    if(getConf().size() > 0) {
      +      //check ssh.connect-timeout
      +      if(getSshConnectTimeout() <= 0)
      +        throw new BadFencingConfigurationException(
      +            CONF_CONNECT_TIMEOUT_KEY + 
      +            "property value should be positive and non-zero");
      +      //check the settings of dfs.ha.fencing.ssh.private-key-files
      +      for (String keyFilePath : getKeyFiles()) {
      +        File keyFile = new File(keyFilePath);
      +        if(!keyFile.isFile() || !keyFile.canRead())
      +            throw new BadFencingConfigurationException(
      +                "The configured private key file is invalid: " + keyFilePath);
      +      }
      +    }





