Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8721

ZKFC should not retry 45 times when attempting a graceful fence during a failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.0.0-alpha
    • 2.0.2-alpha
    • auto-failover, ha
    • None

    Description

      Scenario:
      Active NN on machine1
      Standby NN on machine2
      Machine1 is isolated from the network (machine1 network cable unplugged)
      After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there.
      ZKFC tries to failover NN2 as active.
      As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured)
      This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries)
      Also after that standby NN is not able to take over as active (because of fencing failure).
      Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network

      From ZKFC log:

      2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
      2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
      2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
      2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
      2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
      2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
      2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
      2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
      2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
      2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
      

      Attachments

        1. HDFS-3561.patch
          3 kB
          Vinayakumar B
        2. HDFS-3561-2.patch
          3 kB
          Vinayakumar B
        3. HDFS-3561-3.patch
          3 kB
          Vinayakumar B

        Issue Links

          Activity

            People

              vinayakumarb Vinayakumar B
              suja suja s
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: