Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.23.5
-
None
-
Reviewed
Description
If an NM goes down and the AM still tries to launch a container on it the ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the RM may notice that the NM has gone away and inform the AM of this, this triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try to kill the container, but the ContainerLauncherImpl will not send back a TA_CONTAINER_CLEANED event causing the attempt to be stuck.