Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3385

Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).
      The race condition is similar as YARN-3023.
      since the race condition exists for ZK node creation, it should also exist for ZK node deletion.
      We see this issue with the following stack trace:

      2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
      	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      	at java.lang.Thread.run(Thread.java:745)
      2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
      

      Attachments

        1. YARN-3385.000.patch
          8 kB
          Zhihai Xu
        2. YARN-3385.001.patch
          8 kB
          Zhihai Xu
        3. YARN-3385.002.patch
          10 kB
          Zhihai Xu
        4. YARN-3385.003.patch
          10 kB
          Zhihai Xu
        5. YARN-3385.004.patch
          10 kB
          Zhihai Xu

        Issue Links

          Activity

            People

              zxu Zhihai Xu
              zxu Zhihai Xu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: