Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12132

Both two NameNodes become Standby because the ZKFC exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.8.1
    • None
    • auto-failover
    • None

    Description

      Active NameNode become Standby because the ZKFC exception and Standby NameNode is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing exception in ZKFC seems to be problematic, ZKFC should guarantee to have a active NameNode.

      Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was standby
      The configuration before upgrading is as follows:

      dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
      dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
      

      After upgrading, add the configuration of the separate RPC service:

      dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
      dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
      dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
      dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
      dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
      dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
      

      The upgrade steps are as follows:
      1. Upgrade NN2: restart NameNode process on NN2
      2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, NN1 is standby
      3. Restart both ZKFC on NN1 and NN2

      After restarting ZKFC, Active NameNodes have become Standby, and Standby NameNode did not become Active. Two ZKFC having been doing a loop and threw many same exceptions.

      createLockNodeAsync()  // create lock successfully
      becomeActive()  // return false
      terminateConnection()  // delete EPHEMERAL znode of 'ActiveStandbyElectorLock'  
      sleepFor(sleepTime)
      

      After running command 'hdfs zkfc -formatZK', ZKFC backs to normal.
      ZKFC Exception log is:

      2017-07-11 18:49:44,311 WARN [main-EventThread] org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
      java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
      namenodeId: "nn2"
      hostname: “nn2_hostname”
      port: 8020
      zkfcPort: 8019
      , address from our own configuration for this NameNode was nn2_hostname/xx.xxx.xx.xxx:8021
              at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
              at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
              at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
              at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
              at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
      2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: Session: 0x15c3ada0ec319aa closed
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            yangjiandan Jiandan Yang
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: