Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.0.0-alpha2
-
None
-
None
Description
Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch can also fail if the NM, where the AM container is allocated, goes unresponsive. Because it is not handled, scheduler may continue to allocate AM containers on that same NM for the following app attempts.
Application application_1478721503753_0870 failed 2 times due to Error launching appattempt_1478721503753_0870_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host Details : local host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1408) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy86.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy87.startContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) at org.apache.hadoop.ipc.Client.call(Client.java:1447) ... 15 more Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) ... 18 more . Failing the application.
Attachments
Attachments
Issue Links
- relates to
-
YARN-4576 Enhancement for tracking Blacklist in AM Launching
- Open