Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
2.6.0
-
None
-
Reviewed
Description
When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with RPC which could throw following exceptions when RM get restarted at the same time, like following exception shows:
2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - Unexpected error rebooting NodeStatusUpdater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215) at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) 2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager (NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223) at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215) ... 1 more Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) 2015-08-17 14:35:59,445 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042 2015-08-17 14:35:59,547 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1439417357296_45357, application_1439417357296_45403, application_1439417357296_45355, application_1439417357296_45111, application_1439417357296_45452, application_1439417357296_45350, application_1439417357296_45499, application_1439417357296_45205, application_1439417357296_21009] 2015-08-17 14:35:59,548 INFO ipc.Server (Server.java:stop(2469)) - Stopping server on 45454 2015-08-17 14:35:59,551 INFO ipc.Server (Server.java:run(717)) - Stopping IPC Server listener on 45454 2015-08-17 14:35:59,551 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2015-08-17 14:35:59,552 INFO ipc.Server (Server.java:run(843)) - Stopping IPC Server Responder
It will make NM restart get failed. We should have a simple fix to allow this register to RM can retry with connection failures.
Attachments
Attachments
Issue Links
- relates to
-
YARN-4274 NodeStatusUpdaterImpl should register to RM again after a non-fatal exception happen before
- Open