Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5006

ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.6.0, 2.7.2
    • Fix Version/s: 2.9.0, 3.0.0-alpha4
    • Component/s: resourcemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Client submit a job, this job add 10000 file into DistributedCache. when the job is submitted, ResourceManager sotre ApplicationStateData into zk. ApplicationStateData is exceed the limit size of znode. RM exit 1.

      The related code in RMStateStore.java :

        private static class StoreAppTransition
            implements SingleArcTransition<RMStateStore, RMStateStoreEvent> {
          @Override
          public void transition(RMStateStore store, RMStateStoreEvent event) {
            if (!(event instanceof RMStateStoreAppEvent)) {
              // should never happen
              LOG.error("Illegal event type: " + event.getClass());
              return;
            }
            ApplicationState appState = ((RMStateStoreAppEvent) event).getAppState();
            ApplicationId appId = appState.getAppId();
            ApplicationStateData appStateData = ApplicationStateData
                .newInstance(appState);
            LOG.info("Storing info for app: " + appId);
            try {  
              store.storeApplicationStateInternal(appId, appStateData);  //store the appStateData
              store.notifyApplication(new RMAppEvent(appId,
                     RMAppEventType.APP_NEW_SAVED));
            } catch (Exception e) {
              LOG.error("Error storing app: " + appId, e);
              store.notifyStoreOperationFailed(e);   //handle fail event, system exit 
            }
          };
        }
      

      The Exception log:

       ...
      2016-04-20 11:26:35,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
      
      2016-04-20 11:26:35,732 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore AsyncDispatcher event handler: Error storing app: application_1461061795989_17671
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
              at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:860)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
              at java.lang.Thread.run(Thread.java:724)
      
         ...
      2016-04-20 11:26:45,613 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager AsyncDispatcher event handler: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
              at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore
      .java:860)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
              at java.lang.Thread.run(Thread.java:724)
      2016-04-20 11:26:45,615 INFO org.apache.hadoop.util.ExitUtil AsyncDispatcher event handler: Exiting with status 1
      2016-04-20 11:26:45,622 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-17,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2016-04-20 11:26:45,623 INFO org.mortbay.log Thread-1: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@10.0.0.1:9088
      2016-04-20 11:26:45,623 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-21,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2016-04-20 11:26:45,624 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager Thread[Thread-19,5,main]: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2016-04-20 11:26:45,724 INFO org.apache.hadoop.ipc.Server Thread-1: Stopping server on 9033
      2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server listener on 9033: Stopping IPC Server listener on 9033
      2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector Thread-1: Yielding from election
      2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server Responder: Stopping IPC Server Responder
      2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector Thread-1: Deleting bread-crumb of active node...
      2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ZooKeeper Thread-1: Session: 0x2504c1df9409094 closed
      2016-04-20 11:26:45,729 WARN org.apache.hadoop.ha.ActiveStandbyElector main-EventThread: Ignoring stale result from old client with sessionId 0x2504c1df9409094
      2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ClientCnxn main-EventThread: EventThread shut down
      
      

        Attachments

        1. YARN-5006.001.patch
          9 kB
          Bibin Chundatt
        2. YARN-5006.002.patch
          14 kB
          Bibin Chundatt
        3. YARN-5006.003.patch
          15 kB
          Bibin Chundatt
        4. YARN-5006.004.patch
          15 kB
          Bibin Chundatt
        5. YARN-5006.005.patch
          15 kB
          Bibin Chundatt
        6. YARN-5006-branch-2.005.patch
          15 kB
          Bibin Chundatt

          Issue Links

            Activity

              People

              • Assignee:
                bibinchundatt Bibin Chundatt
                Reporter:
                dongtingting8877@163.com dongtingting
              • Votes:
                1 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: