Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.6.0, 2.7.2
-
None
-
Reviewed
Description
Client submit a job, this job add 10000 file into DistributedCache. when the job is submitted, ResourceManager sotre ApplicationStateData into zk. ApplicationStateData is exceed the limit size of znode. RM exit 1.
The related code in RMStateStore.java :
private static class StoreAppTransition implements SingleArcTransition<RMStateStore, RMStateStoreEvent> { @Override public void transition(RMStateStore store, RMStateStoreEvent event) { if (!(event instanceof RMStateStoreAppEvent)) { // should never happen LOG.error("Illegal event type: " + event.getClass()); return; } ApplicationState appState = ((RMStateStoreAppEvent) event).getAppState(); ApplicationId appId = appState.getAppId(); ApplicationStateData appStateData = ApplicationStateData .newInstance(appState); LOG.info("Storing info for app: " + appId); try { store.storeApplicationStateInternal(appId, appStateData); //store the appStateData store.notifyApplication(new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED)); } catch (Exception e) { LOG.error("Error storing app: " + appId, e); store.notifyStoreOperationFailed(e); //handle fail event, system exit } }; }
The Exception log:
... 2016-04-20 11:26:35,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore AsyncDispatcher event handler: Maxed out ZK retries. Giving up! 2016-04-20 11:26:35,732 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore AsyncDispatcher event handler: Error storing app: application_1461061795989_17671 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:860) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:724) ... 2016-04-20 11:26:45,613 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager AsyncDispatcher event handler: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore .java:860) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:724) 2016-04-20 11:26:45,615 INFO org.apache.hadoop.util.ExitUtil AsyncDispatcher event handler: Exiting with status 1 2016-04-20 11:26:45,622 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-17,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2016-04-20 11:26:45,623 INFO org.mortbay.log Thread-1: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@10.0.0.1:9088 2016-04-20 11:26:45,623 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-21,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2016-04-20 11:26:45,624 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-19,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2016-04-20 11:26:45,724 INFO org.apache.hadoop.ipc.Server Thread-1: Stopping server on 9033 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server listener on 9033: Stopping IPC Server listener on 9033 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector Thread-1: Yielding from election 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server Responder: Stopping IPC Server Responder 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector Thread-1: Deleting bread-crumb of active node... 2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ZooKeeper Thread-1: Session: 0x2504c1df9409094 closed 2016-04-20 11:26:45,729 WARN org.apache.hadoop.ha.ActiveStandbyElector main-EventThread: Ignoring stale result from old client with sessionId 0x2504c1df9409094 2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ClientCnxn main-EventThread: EventThread shut down
Attachments
Attachments
Issue Links
- breaks
-
YARN-6819 Application report fails if app rejected due to nodesize
- Resolved
- duplicates
-
YARN-6531 Check appStateData size before saving to Zookeeper
- Resolved
- is related to
-
YARN-6825 RM quit due to ApplicationStateData exceed the limit size of znode in zk
- Open
-
YARN-6743 yarn.resourcemanager.zk-max-znode-size.bytes description needs spaces in yarn-default.xml
- Resolved
- relates to
-
YARN-2368 ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
- Reopened
-
YARN-9847 ZKRMStateStore will cause zk connection loss when writing huge data into znode
- Resolved