Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6409

RM does not blacklist node for AM launch failures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.0.0-alpha2
    • None
    • resourcemanager
    • None

    Description

      Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch can also fail if the NM, where the AM container is allocated, goes unresponsive. Because it is not handled, scheduler may continue to allocate AM containers on that same NM for the following app attempts.

      Application application_1478721503753_0870 failed 2 times due to Error launching appattempt_1478721503753_0870_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host Details : local host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; 
      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 
      at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
      at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) 
      at com.sun.proxy.$Proxy86.startContainers(Unknown Source) 
      at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) 
      at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) 
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
      at java.lang.reflect.Method.invoke(Method.java:497) 
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) 
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) 
      at com.sun.proxy.$Proxy87.startContainers(Unknown Source) 
      at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) 
      at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 
      Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] 
      at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) 
      at java.security.AccessController.doPrivileged(Native Method) 
      at javax.security.auth.Subject.doAs(Subject.java:422) 
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) 
      at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) 
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) 
      at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
      at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) 
      at org.apache.hadoop.ipc.Client.call(Client.java:1447) 
      ... 15 more 
      Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] 
      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
      at java.io.FilterInputStream.read(FilterInputStream.java:133) 
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
      at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
      at java.io.DataInputStream.readInt(DataInputStream.java:387) 
      at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) 
      at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) 
      at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) 
      at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) 
      at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) 
      at java.security.AccessController.doPrivileged(Native Method) 
      at javax.security.auth.Subject.doAs(Subject.java:422) 
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) 
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) 
      ... 18 more 
      . Failing the application.
      

      Attachments

        1. YARN-6409.00.patch
          14 kB
          Haibo Chen
        2. YARN-6409.01.patch
          14 kB
          Haibo Chen
        3. YARN-6409.02.patch
          15 kB
          Haibo Chen
        4. YARN-6409.03.patch
          19 kB
          Haibo Chen

        Issue Links

          Activity

            People

              kanwaljeets Kanwaljeet Sachdev
              haibochen Haibo Chen
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: