[YARN-3023] Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash.

The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries

2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_000001 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_000001,

2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded.

2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

3.RM did retry to store attempt state to ZK after one second

2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1

4. during the one second interval, the ZK session is reconnected.

2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session
2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 10000

5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException

2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!

6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore

2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_000001
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager

  protected void notifyStoreOperationFailed(Exception failureCause) {
    RMFatalEventType type;
    if (failureCause instanceof StoreFencedException) {
      type = RMFatalEventType.STATE_STORE_FENCED;
    } else {
      type = RMFatalEventType.STATE_STORE_OP_FAILED;
    }
    rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.

2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

Attachments

Issue Links

is related to

YARN-3385 Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

Closed

relates to

YARN-2721 Race condition: ZKRMStateStore retry logic may throw NodeExist exception

Closed

Activity

People

Assignee:: Zhihai Xu

Reporter:: Zhihai Xu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jan/15 03:55

Updated:: 22/Mar/15 18:13

Resolved:: 09/Jan/15 05:39