Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4858

HDFS DataNode to NameNode RPC should timeout

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha, 3.0.0-alpha1
    • 2.4.0
    • datanode
    • None
    • Redhat/CentOS 6.4 64 bit Linux

    • Reviewed

    Description

      The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat.

      What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails.

      The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service.

      Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object.

      Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:

      Thread 25 (DataNode: file:///opt/hadoop/data heartbeating to vmhost6-vm1/10.10.10.151:8020):
      State: WAITING
      Blocked count: 23843
      Waited count: 45676
      Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
      Stack:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:485)
      org.apache.hadoop.ipc.Client.call(Client.java:1220)
      org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
      sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
      sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
      sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      java.lang.reflect.Method.invoke(Method.java:597)
      org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
      org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
      sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
      org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
      org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
      org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
      org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
      java.lang.Thread.run(Thread.java:662)

      DataNode RPC to Standby NameNode never times out.

      Attachments

        1. HDFS-4858.patch
          1 kB
          Jagane Sundar
        2. HDFS-4858.patch
          3 kB
          Henry Wang
        3. HDFS-4858.patch
          2 kB
          Henry Wang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            henry.wang Henry Wang
            jagane Jagane Sundar
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment