Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8855

Application fails if one of the sublcluster is down.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • federation
    • None

    Description

      If one of sub cluster is down then application keeps on trying multiple times and then it fails About 30 failover attempts found in the logs. Below is the detailed exception. 

      2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container container_e03_1538297667953_0005_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
      2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing container_e03_1538297667953_0005_01_000001 from application application_1538297667953_0005 | ApplicationImpl.java:512
      2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping resource-monitoring for container_e03_1538297667953_0005_01_000001 | ContainersMonitorImpl.java:932
      2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering container container_e03_1538297667953_0005_01_000001 for log-aggregation | AppLogAggregatorImpl.java:538
      2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
      2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping container container_e03_1538297667953_0005_01_000001 | YarnShuffleService.java:295
      2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container container_e03_1538297667953_0005_01_000001 while processing FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
      2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed containers from NM context: [container_e03_1538297667953_0005_01_000001] | NodeStatusUpdaterImpl.java:696
      2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
      2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
      2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
      2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28 failover attempts. Trying to failover after sleeping for 15261ms. | RetryInvocationHandler.java:411
      2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
      2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
      2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
      2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29 failover attempts. Trying to failover after sleeping for 21175ms. | RetryInvocationHandler.java:411
      2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124
      2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258
      2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145
      2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register application master: cluster2 Application: appattempt_1538297667953_0005_000001 | FederationInterceptor.java:1106
      java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
      at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1459)
      

      cc botong 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rahulanand90 Rahul Anand
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: