Updating the patch with the review comments.
For handling the case where JVM is unregistered before it gets a task, we should remove it from launchedJVMs during unregister. Once we do this, we should think about synchronization issues carefully.
Good catch. Uploading a patch which removes the jvm from the launchedJVMs set in the unregister call, prior to removing from jvmIDToActiveAttemptMap. The ordering of events between unregister and getTask should take care of synchronization issues.
We went through a couple of iterations on this part of the code, so let us make sure things are fine by running the AMScalability benchmark (100K maps) once.
Have already run a sort benchmark with the previous patch and the patch from
MAPREDUCE-3596, which passed. Can run AMScalability as well - but this issue has never been seen with AMScalability (shows up primarily when shuffle starts and the startContainer calls slow down).
Another change which can be made is to have TaskAttemptListener / TaskHeartbeatHandler throw Exceptions for calls from unregistered tasks. Currently the AM relies on the NM stopContainer to kill these tasks. Opening a separate jira for this. Also one for the NM startContainer calls slowing down.