I think we need to fix the RMContainerImpl ALLOCATED to KILLED transition, but I think there's another bug here. I believe the container was killed in the first place because the RMNodeImpl reconnect transition makes an assumption that is racy. When the node reconnects, it checks if the node reports no applications running. If it has no applications then it sends a removed node eventfollowed by a added node event to the scheduler. This will cause the scheduler to kill all containers allocated on that node. However the node will only know about a container iff the AM acquires the container and tries to launch the container on the node. That can take minutes to transpire, so it's dangerous to assume that a node not reporting any applications on the node means it doesn't have anything pending.
I think we'll have to revisit the solution to
YARN-2561 to either eliminate this race or make it safe if it does occur. Ideally we shouldn't be sending a remove/add event to the scheduler if the node is reconnecting, but we need to make sure we cancel containers on the node that are no longer running. Since the node reports what containers it has when it reconnects, it seems like we can convey that information to the scheduler to correct anything that doesn't match up. Any container in the RUNNING state that no longer appears in the list of containers when registering can be killed by the scheduler, as it does when a node is removed, and I believe that will fix YARN-2561 and also avoid this race.
cc: Junping Du as this also has potential ramifications for graceful decommission. If we try to graceful decommission a node that isn't currently reporting applications we may also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet.