Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9064

Both Resource Managers stay in standby after connection to ZooKeeper was recovered

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.6.0
    • Fix Version/s: None
    • Component/s: resourcemanager, yarn
    • Labels:
      None
    • Environment:
      • cluster of 31 nodes
      • each node is a VM with 60GB of RAM and 8 vcpus
      • each VM is running CentOS 7.2 with Hadoop 2.6.0
      • Hadoop cluster is secured with Kerberos
      • Hadoop cluster is configured with HA

      Description

      I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos and configured in HA. The first 3 nodes hold both slave and master services:

      • Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History Server, DataNode, NodeManager, ZooKeeper and Kerberos
      • Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, ZooKeeper and Kerberos
      • Node-3: JournalNode, DataNode, NodeManager and ZooKeeper
      • Node-4..Node-31: DataNode and NodeManager

      At one moment there was a problem with the switch the nodes were connected to and all the services started loosing connectivity.

      1. At first Kerberos stopped granting any tickets
      2. This broke the cluster as Hadoop services could not authenticate to each other.
      3. At some point ZooKeeper cluster lost leader and started re-election.
      4. This resulted in multiple ZooKeeper-related errors and warnings in ResourceManager and ZKFC logs.
      5. After a while, when the issue with the switch was resolved most of services recovered automatically
      6. "Most" except YARN:
        1. both ResourceManager were stuck in standby mode
        2. all NodeManagers were shutdown
      7. I have managed to recover YARN, however it required manual restart of both ResourceManagers (and starting all NodeManagers)

      I have all the logs from the incident but the most important seem to be those:

      2018-11-16 03:21:16,420 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1539778834071_0622_000001
      2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1539778834071_0622_000001
      2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1539778834071_0622 State change from NEW to ACCEPTED on event = RECOVER
      2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 622 out of 622 applications
      2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 1
      2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended
      2018-11-16 03:21:16,425 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1539778834071_0622_000002
      2018-11-16 03:21:16,426 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on event = START
      2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens
      2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens
      2018-11-16 03:21:16,427 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
      2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 32
      2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey.
      2018-11-16 03:21:16,440 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
      2018-11-16 03:21:16,441 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
      2018-11-16 03:21:16,444 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 33
      2018-11-16 03:21:16,445 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey.
      2018-11-16 03:21:16,458 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1539778834071_0622 requests cleared
      2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler from user packer
      2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
      2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000
      2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
      		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
      		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
      		at java.security.AccessController.doPrivileged(Native Method)
      		at javax.security.auth.Subject.doAs(Subject.java:422)
      		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
      		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
      		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
      		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
      		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
      		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
      		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
      		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
      		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
      		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
      		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
      		... 20 more
      Caused by: java.net.SocketException: Unresolved address
      		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
      		at sun.nio.ch.Net.translateException(Net.java:157)
      		at sun.nio.ch.Net.translateException(Net.java:163)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
      		... 28 more
      Caused by: java.nio.channels.UnresolvedAddressException
      		at sun.nio.ch.Net.checkAddress(Net.java:101)
      		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
      		... 29 more
      2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
      		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
      		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
      		at java.security.AccessController.doPrivileged(Native Method)
      		at javax.security.auth.Subject.doAs(Subject.java:422)
      		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
      		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
      		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
      		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
      		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
      		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
      		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
      		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
      		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
      		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
      		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
      		... 20 more
      Caused by: java.net.SocketException: Unresolved address
      		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
      		at sun.nio.ch.Net.translateException(Net.java:157)
      		at sun.nio.ch.Net.translateException(Net.java:163)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
      		... 28 more
      Caused by: java.nio.channels.UnresolvedAddressException
      		at sun.nio.ch.Net.checkAddress(Net.java:101)
      		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
      		... 29 more
      2018-11-16 03:21:16,470 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException
      2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
      2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
      2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted
      2018-11-16 03:21:16,472 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
      2018-11-16 03:21:16,472 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system...
      2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped.
      2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete.
      2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events.
      2018-11-16 03:21:16,477 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting!
      2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 0x3671a89731f0000 closed
      2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
      2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
      2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
      2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms
      2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms
      2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: Using RMStateStore implementation - class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
      2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
      2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
      2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
      2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
      2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system started
      2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
      2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
      2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already exists.
      2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo MBean
      2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: YARN system metrics publishing service is not enabled
      2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
      2018-11-16 03:21:16,496 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=transitionToActive    TARGET=RMHAProtocolService      RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   PERMISSIONS=
      2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
      org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
      		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134)
      		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
      		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
      		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
      		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
      		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311)
      		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
      		... 4 more
      Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
      		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
      		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
      		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
      		at java.security.AccessController.doPrivileged(Native Method)
      		at javax.security.auth.Subject.doAs(Subject.java:422)
      		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
      		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
      		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
      		... 5 more
      Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
      		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
      		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
      		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
      		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
      		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
      		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
      		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
      		... 20 more
      Caused by: java.net.SocketException: Unresolved address
      		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
      		at sun.nio.ch.Net.translateException(Net.java:157)
      		at sun.nio.ch.Net.translateException(Net.java:163)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
      		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
      		... 28 more
      Caused by: java.nio.channels.UnresolvedAddressException
      		at sun.nio.ch.Net.checkAddress(Net.java:101)
      		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
      		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
      		... 29 more
      2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
      2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 0x36681eb8c720002 closed
      2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b
      2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: Unable to connect to server: node-2.mydomain.com:2181
      java.net.UnknownHostException: node-2.mydomain.com
      		at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
      		at java.net.InetAddress.getAllByName(InetAddress.java:1192)
      		at java.net.InetAddress.getAllByName(InetAddress.java:1126)
      		at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
      		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
      		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
      		at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630)
      		at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774)
      		at org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749)
      		at org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660)
      		at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688)
      		at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530)
      		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484)
      		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
      		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to authenticate using SASL (unknown error)
      2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.242.1.105:46773, server: node-3.mydomain.com/10.242.1.106:2181
      2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-3.mydomain.com/10.242.1.106:2181, sessionid = 0x3671a89731f0003, negotiated timeout = 10000
      2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x36681eb8c720002
      2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
      2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml
      2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
      2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state
      2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
      2018-11-16 03:30:57,669 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up
      2018-11-16 03:31:16,496 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up
      2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
      2018-11-19 13:35:39,357 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
      2018-11-19 13:35:45,785 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature
      2018-11-21 08:29:19,995 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature
      2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
      2018-11-21 08:29:23,666 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
      2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
      2018-11-21 08:31:37,258 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
      

      I have found a few tickets about some race conditions in YARN popping out when issues with connecting to ZooKeeper occur but either they should have been fix in 2.6.0 or the logs don't match.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kostrzewa Zbigniew Kostrzewa
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: