Seems like it's possible to avoid multiple AMs by tuning the AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS (6 minutes by default). A new AM should only be started after the existing AM is done.
That almost solves the problem, but there are some corner cases left unsolved. For example:
1) AM is running on a node whose NM suddenly declares itself UNHEALTHY via health-check script
2) RM removes node from active nodes and kills all containers running on that node
3) Network cut occurs. NM did not receive notification to kill the containers and/or NM crashes. AM is unable to communicate to RM.
4) RM now thinks all containers are dead on that node, proceeds to relaunch a new AM attempt
5) Now for the next 6 minutes (or whatever the expiry interval is for the AM to RM) we have two app attempts running simultaneously. If the old AM attempt is able to reach HDFS or whatever it needs to commit, we could end up committing twice.
Could add a check to ensure the window interval is greater than the AM-RM heartbeat.
Actually that's not strictly necessary. The code can function correctly even if the commit window is smaller than the heartbeat interval. For example, job commit is woken up when a fresh heartbeat arrives, and task commit polls periodically for whether the heartbeat has occurred recently. It's not mandatory that the interval between heartbeats is smaller than the commit window for a commit to proceed, but it is more likely a commit operation will be stalled waiting for a fresh heartbeat if configured that way.
Does getClock() need to be part of the RMHeartbeatHandler. Looks like the AppContext can provide this
I put it in the interface so the caller can access the same clock used to timestamp the heartbeat in case it could be different from the AppContext clock or if the caller didn't have access to the AppContext. But that's probably never going to be a real concern, so I'll take it out.
And to address Bikas' comment:
Independent of this change, this looks like a problem that needs to be solved in the platform than in the AM.
We might be able to close all the corner cases in the framework. For example, the above scenario could be solved if the RM were to wait for confirmation from the NM of the containers actually expiring before proceeding to launch another attempt. If the NM is unreachable before the confirmation is received, it could wait for the AM expiry interval before launching a new attempt. It could mean that we wait a lot longer than necessary, but at least we'd know with confidence that two attempts aren't running simultaneously.