Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13760

improve ZKFC fencing action when network of ZKFC interrupt

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ha
    • Labels:
      None

      Description

      +underlined text+when host of Active NameNode & ZKFC meet network fault for quite a time, HDFS will be not available since ZKFC located on Standby NameNode will never ssh fence success due to it could not ssh to Active NameNode. In such situation, for Client, it could not connect to Active NameNode, then failover to Standby but it could not provide READ/WRITE.

      2018-07-23 15:57:10,836 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 40 time(s); maxRetries=45
      2018-07-23 15:57:30,856 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 41 time(s); maxRetries=45
      2018-07-23 15:57:50,872 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 42 time(s); maxRetries=45
      2018-07-23 15:58:10,892 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 43 time(s); maxRetries=45
      2018-07-23 15:58:30,912 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 44 time(s); maxRetries=45
      2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ZKFailoverController: get old active state exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be 
      ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/ip:port remote=hostname]
      2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: old active is not healthy. need to create znode
      2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: Elector callbacks for NameNode at standbynn start create node, now time: 45179010079342817
      2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: CreateNode result: 0 code:OK for path: /hadoop-ha/ns/ActiveStandbyElectorLock connectionState: CONNECTED  for elector id=469098346 appData=0a07727a2d6e6e313312046e6e31331a1f727a2d646174612d6864702d6e6e31332e727a2e73616e6b7561692e636f6d20e83e28d33e cb=Elector callbacks for NameNode at standbynamenode
      2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
      2018-07-23 15:58:50,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a07727a2d6e6e313312046e6e31341a1f727a2d646174612d6864702d6e6e31342e727a2e73616e6b7561692e636f6d20e83e28d33e
      2018-07-23 15:58:50,939 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at activenamenode
      2018-07-23 15:59:10,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: activenamenode. Already tried 0 time(s); maxRetries=1
      2018-07-23 15:59:30,980 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at activenamenode standby (unable to connect)
      org.apache.hadoop.net.ConnectTimeoutException: Call From standbynamenode to activenamenode failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=ip:port remote=activenamenode]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
      

      I propose that when Active NameNode meet network fault, ZKFC force this NameNode to become Standby, and another ZKFC could hold the ZNode for election and transition other NameNode to Active even when ssh fence fail.

      There is no available patch now, and I am very welcome to hear some suggestion.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hexiaoqiao Xiaoqiao He
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: