Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14857

FS operations fail in HA mode: DataNode fails to connect to NameNode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • datanode

    Description

      In an HA configuration, if the NameNodes get restarted and if they're assigned new IP addresses, any client FS operation such as a copyFromLocal will fail with a message like the following:

      2019-09-12 18:47:30,544 WARN hdfs.DataStreamer: DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/init.sh.COPYING could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2211) ...

       

      Looking at DataNode's stderr shows the following:

      • The heartbeat service detects the IP change and recovers (almost)
      • At this stage, an hdfs dfsadmin -report reports all datanodes correctly
      • Once the write begins, the follwoing exception shows up in the datanode log: no route to host

      2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in offerService2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in offerServicejava.io.EOFException: End of File Exception between local host is: "storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211"; destination host is: "nmnode-0-0.nmnode-0-svc.test.svc.cluster.local":9000; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549) at org.apache.hadoop.ipc.Client.call(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1388) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:166) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:516) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:646) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847) at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)
      2019-09-12 01:41:12,273 WARN ipc.Client: Address change detected. Old: nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 New: nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.220:9000...

       

      2019-09-12 01:41:12,482 INFO datanode.DataNode: Block pool BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 7673ef28-957a-439f-a721-d47a4a6adb7b) service to nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 beginning handshake with NN
      2019-09-12 01:41:12,534 INFO datanode.DataNode: Block pool Block pool BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 7673ef28-957a-439f-a721-d47a4a6adb7b) service to nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 successfully registered with NN

       

      NOTE:  See how when the 'Address change detected' shows up, the printout correctly shows the old and the new address (10.244.0.220). However when the registration with NN is complete, the old IP address is still being printed (10.244.0.217) which is showing how cached copies of the IP addresses linger on.{{}}

       

      And the following is where the actual error happens preventing any writes to FS:

       

      2019-09-12 18:45:29,843 INFO retry.RetryInvocationHandler: java.net.NoRouteToHostException: No Route to Host from storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211 to nmnode-0-1.nmnode-0-svc:50200 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost, while invoking InMemoryAliasMapProtocolClientSideTranslatorPB.read over nmnode-0-1.nmnode-0-svc/10.244.0.217:50200 after 3 failover attempts. Trying to failover after sleeping for 4452ms.}}{{

       

      Attachments

        Issue Links

          Activity

            People

              jeffsaremi2 Jeff Saremi
              jeffsaremi2 Jeff Saremi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m