Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7727

Check and verify the auto-fence settings to prevent failures of auto-failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.4.1, 2.6.0, 2.5.1
    • None
    • auto-failover

    Description

      ============================
      Problem
      -------------------------------------------------

      Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.

      Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover.

      Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process.

      ============================
      Parameters
      -------------------------------------------------

      In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java
      "dfs.ha.fencing.ssh.connect-timeout";
      "dfs.ha.fencing.ssh.private-key-files"

      They are used in the tryFence() function for auto-fencing.

      Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover.

      For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).

      For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method.

      The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter.

      2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
      2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
      2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session
      com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory)
              at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98)
              at com.jcraft.jsch.JSch.addIdentity(JSch.java:206)
              at com.jcraft.jsch.JSch.addIdentity(JSch.java:192)
              at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
              at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
              at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
              at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
              at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
              at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
              at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
              at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
      Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory)
              at java.io.FileInputStream.open(Native Method)
              at java.io.FileInputStream.<init>(FileInputStream.java:146)
              at java.io.FileInputStream.<init>(FileInputStream.java:101)
              at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83)
              ... 14 more
      

      ============================
      Solution (the patch)
      -------------------------------------------------

      Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters.

      SshFenceByTcpPort.java
        /**
         * Verify that the argument, if given, in the conf is parseable.
         */
        @Override
        public void checkArgs(String argStr) throws BadFencingConfigurationException {
          if (argStr != null) {
            new Args(argStr);
          }
          <= Insert the checkers here (see the patch attached)
        }
      

      The detailed patch is shown below.

      @@ -76,6 +77,23 @@
           if (argStr != null) {
             new Args(argStr);
           }
      +
      +    //The configuration could be empty (e.g., called from DFSHAAdmin)
      +    if(getConf().size() > 0) {
      +      //check ssh.connect-timeout
      +      if(getSshConnectTimeout() <= 0)
      +        throw new BadFencingConfigurationException(
      +            CONF_CONNECT_TIMEOUT_KEY + 
      +            "property value should be positive and non-zero");
      +
      +      //check the settings of dfs.ha.fencing.ssh.private-key-files
      +      for (String keyFilePath : getKeyFiles()) {
      +        File keyFile = new File(keyFilePath);
      +        if(!keyFile.isFile() || !keyFile.canRead())
      +            throw new BadFencingConfigurationException(
      +                "The configured private key file is invalid: " + keyFilePath);
      +      }
      +    }
         }
       
         @Override
      

      Thanks!

      Attachments

        Activity

          People

            tianyin Tianyin Xu
            tianyin Tianyin Xu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: