Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.7.1
-
None
-
Reviewed
Description
ActiveStandbyElector shouldn't call monitorLockNodeAsync before StatCallback for previous zkClient.exists is received.
We saw RM shutdown because ActiveStandbyElector retrying monitorLockNodeAsync exceeded limit. The following is the logs.
Based on the log, it looks like multiple monitorLockNodeAsync are called at the same time due to back-to-back SyncConnected event received.
The current code doesn't prevent zkClient.exists from being called before AsyncCallback.StatCallback for previous zkClient.exists is received.
So the retry for monitorLockNodeAsync doesn't work correctly sometimes.
2015-07-01 19:24:12,806 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6674ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect 2015-07-01 19:24:12,919 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) 2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43487, server: node-1.internal/192.168.123.3:2181 2015-07-01 19:24:14,707 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 2015-07-01 19:24:14,712 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-07-01 19:24:21,374 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6667ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect 2015-07-01 19:24:21,477 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) 2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43526, server: node-1.internal/192.168.123.3:2181 2015-07-01 19:24:22,641 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 2015-07-01 19:24:22,642 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-07-01 19:24:29,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect 2015-07-01 19:24:29,413 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2015-07-01 19:24:30,738 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) 2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43574, server: node-1.internal/192.168.123.3:2181 2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 2015-07-01 19:24:30,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-07-01 19:24:37,409 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6670ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect 2015-07-01 19:24:37,512 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) 2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43598, server: node-1.internal/192.168.123.3:2181 2015-07-01 19:24:38,980 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 2015-07-01 19:24:38,981 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-07-01 19:24:45,649 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect 2015-07-01 19:24:45,752 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-07-01 19:24:45,855 INFO org.apache.zookeeper.ZooKeeper: Session: 0x14e47693cc20007 closed 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 2015-07-01 19:25:07,932 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type EMBEDDED_ELECTOR_FAILED. Cause: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 2015-07-01 19:25:07,933 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 2015-07-01 19:25:07,933 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-07-01 19:25:08,036 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1