Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5579

Resourcemanager should surface failed state store operation prominently

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.7.3
    • None
    • None

    Description

      I found the following in Resourcemanager log when I tried to figure out why application got stuck in NEW_SAVING state.

      2016-08-29 18:14:23,486 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1242)) - Maxed out ZK retries. Giving up!
      2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:transition(205)) - Error storing app: application_1470517915158_0001
      org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:201)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:183)
              at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
              at java.lang.Thread.run(Thread.java:745)
      2016-08-29 18:14:23,486 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(987)) - State store operation failed
      org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
              at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
              at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
              at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
      

      Resourcemanager should surface the above error prominently.

      Likely subsequent application submission would encounter the same error.

      Attachments

        There are no Sub-Tasks for this issue.

        Activity

          People

            Unassigned Unassigned
            yuzhihong@gmail.com Ted Yu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: