Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3023

Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.6.0
    • None
    • resourcemanager
    • None

    Description

      Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash.

      The sequence for the Race condition is the following:
      1, RM Store attempt state to ZK by calling createWithRetries

      2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_000001 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_000001,
      

      2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK.
      The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded.

      2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
      

      3.RM did retry to store attempt state to ZK after one second

      2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1
      

      4. during the one second interval, the ZK session is reconnected.

      2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session
      2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 10000
      

      5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck),
      For the second try, it will fail with NodeExists KeeperException

      2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
      org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
      2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!
      

      6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore

      2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_000001
      org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
      

      7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager

        protected void notifyStoreOperationFailed(Exception failureCause) {
          RMFatalEventType type;
          if (failureCause instanceof StoreFencedException) {
            type = RMFatalEventType.STATE_STORE_FENCED;
          } else {
            type = RMFatalEventType.STATE_STORE_OP_FAILED;
          }
          rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
        }
      

      8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.

      2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
      2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
      

      Attachments

        Issue Links

          Activity

            People

              zxu Zhihai Xu
              zxu Zhihai Xu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: