Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4288

NodeManager restart should keep retrying to register to RM while connection exception happens during RM failed over.

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0, 2.7.3, 3.0.0-alpha1
    • nodemanager
    • None
    • Reviewed

    Description

      When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with RPC which could throw following exceptions when RM get restarted at the same time, like following exception shows:

      2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - Unexpected error rebooting NodeStatusUpdater
      java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025;
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
              at org.apache.hadoop.ipc.Client.call(Client.java:1473)
              at org.apache.hadoop.ipc.Client.call(Client.java:1400)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
              at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
              at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
              at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
      Caused by: java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
              at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
              at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
              at java.io.DataInputStream.readInt(DataInputStream.java:387)
              at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
              at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
      2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager (NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater.
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025;
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
      Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025;
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
              at org.apache.hadoop.ipc.Client.call(Client.java:1473)
              at org.apache.hadoop.ipc.Client.call(Client.java:1400)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
              at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
              at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
              at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
              ... 1 more
      Caused by: java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
              at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
              at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
              at java.io.DataInputStream.readInt(DataInputStream.java:387)
              at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
              at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
      2015-08-17 14:35:59,445 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
      2015-08-17 14:35:59,547 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1439417357296_45357, application_1439417357296_45403, application_1439417357296_45355, application_1439417357296_45111, application_1439417357296_45452, application_1439417357296_45350, application_1439417357296_45499, application_1439417357296_45205, application_1439417357296_21009]
      2015-08-17 14:35:59,548 INFO  ipc.Server (Server.java:stop(2469)) - Stopping server on 45454
      2015-08-17 14:35:59,551 INFO  ipc.Server (Server.java:run(717)) - Stopping IPC Server listener on 45454
      2015-08-17 14:35:59,551 INFO  logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
      2015-08-17 14:35:59,552 INFO  ipc.Server (Server.java:run(843)) - Stopping IPC Server Responder
      

      It will make NM restart get failed. We should have a simple fix to allow this register to RM can retry with connection failures.

      Attachments

        1. YARN-4288-v3.patch
          8 kB
          Junping Du
        2. YARN-4288-v2.patch
          8 kB
          Junping Du
        3. YARN-4288.patch
          6 kB
          Junping Du

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            junping_du Junping Du
            junping_du Junping Du
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment