Sorry for joining the discussion late, as I missed this originally. As I mentioned in
YARN-5290, having the RM wait until the NM confirms container release can unnecessarily slow down subsequent allocations on other nodes due to scheduler limits (user limit, queue limit, etc.). We could leverage some form of the NM queuing, but I agree it could be confusing when the AM launches a container and it doesn't appear to be active afterwards when querying the node.
We could have the RM wait until it receives hard confirmation from the NM before it releases the resources associated with a container, but that would needlessly slow down scheduling in some cases. For example, if a user is at the scheduler user limit but releases a container on node A, I don't see why we have to wait until that container is confirmed dead over two subsequent NM heartbeats (one to tell the NM to shoot it and another to confirm its dead) before allowing the user to allocate another container of the same size on node B. However I do think it's bad for us to allocate the new container on the same node as the released one since we can accidentally overwhelm the node if the old container isn't cleaned up fast enough.
Therefore I propose that we go ahead and let the scheduler queues and user limit computations update immediately so other nodes can be scheduled, but we don't release the resources in the SchedulerNode itself until the node confirms a previously running container is dead. IMHO if the RM ever sees a container in the RUNNING state on a node, it should never think that node has freed the resources for that container until the node itself says that container has completed.
Here's a prototype patch against branch-2.7 that is similar to what we're using internally to work around this issue. It goes ahead and releases the resources for running containers in the scheduler bookkeeping (i.e.: cluster resource, queues, user limits, etc.) but not in the SchedulerNode. So the RM could allocate those resources elsewhere but not on the current node until the node reports the container as completed.
NOTE: with any of these "wait until the node says the container is done" approaches it's important to get the fix for
YARN-5197 or if the NM ever skips sending a container completion event the RM will leak those resources on the node.
There is an interesting corner case where the RM has handed out a container to an AM (i.e.: container is in the ACQUIRED state) but it hasn't seen it running on a node yet. If the container is killed by the RM or AM, there's still a chance where the container could appear on the node after the RM has considered those resources freed. We'll have to decide how to handle that race. One way to solve it is to assume the container resources could still be "used" until it has had a chance to tell the NM that the container token for that container is no longer valid and confirmed in a subsequent NM heartbeat that the container has not appeared since. Maybe there's a simpler/faster way to safely free the containers resources for that race condition?