Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5579

Resourcemanager should surface failed state store operation prominently

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.7.3
    • None
    • None

    Description

      I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state.

      2016-08-29 18:14:23,486 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up!
      2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001
      org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183)
              at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
              at java.lang.Thread.run(Thread.java:745)
      2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed
      org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
      

      Resourcemanager should surface the above error prominently.

      Likely subsequent application submission would encounter the same error.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yuzhihong@gmail.com Ted Yu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment