Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.6.0
-
None
-
None
Description
Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash.
The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries
2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_000001 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_000001,
2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded.
2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
3.RM did retry to store attempt state to ZK after one second
2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1
4. during the one second interval, the ZK session is reconnected.
2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 10000
5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException
2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!
6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore
2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_000001 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager
protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); }
8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.
2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1