Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3385

Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).
      The race condition is similar as YARN-3023.
      since the race condition exists for ZK node creation, it should also exist for ZK node deletion.
      We see this issue with the following stack trace:

      2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
      	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      	at java.lang.Thread.run(Thread.java:745)
      2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
      

        Attachments

        1. YARN-3385.004.patch
          10 kB
          zhihai xu
        2. YARN-3385.003.patch
          10 kB
          zhihai xu
        3. YARN-3385.002.patch
          10 kB
          zhihai xu
        4. YARN-3385.001.patch
          8 kB
          zhihai xu
        5. YARN-3385.000.patch
          8 kB
          zhihai xu

          Issue Links

            Activity

              People

              • Assignee:
                zxu zhihai xu
                Reporter:
                zxu zhihai xu
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: