Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited for a bit, then restarted the RM. The AM is restarted, but running agents never connect to the new AM. The AM data is re-read from the ZK registry once if the heartbeat retry threshold is reached, at which point the agent tries re-registering with the AM. However, if the AM data is stale at that point, it never re-reads the data from the ZK registry, and retries registering with the nonexistent AM forever (until it is timed out due to heartbeat loss and killed by the new AM).
Note this happens when AM restart is delayed more than about a minute, which can occur if the RM is down or the RM is up but busy.