Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion.
For container localizer I think ipc timeout should work, if configured. Will check it.
For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. We can do similar to what we do now for containers now. SIGTERM followed by SIGKILL, if required.
Should we add this then ? This would mean that we will have to relaunch a new localizer again, if container is still running. Or fail the container ?
I'm also not sure we need deletion task cancellation. As you point out it's not really necessary.
Ok. Will remove it.
Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist.
Yes, we can simply ignore missing paths. Did not change so as not to break previous behavior. Doesnt seem like anyone is depending on this behavior though.
What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths?
Currently _tmp paths are deleted in finally in FSDownload#call. Wouldnt that be enough to handle case of normal JVM exit ?
With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails.
For this the only concern I see is what I mentioned about the issue I found in FSDownload, that is the download task running even after cancel because code is uninterruptible at places in FSDownload#call.
In this case, we can never know when the cancelled task will complete and create files in the directory. There can be a race which can lead to tmp directory being renamed for instance.
We can in this case in deletion task, first put the _tmp dir and then the real one so that first tmp is deleted.
Also we can add an extra number of seconds to localizer HB timeout for scheduling file deletion, so that localizer is killed (Assuming we adopt approach mentioned in first point, above) before we attempt deletion.
We however would need to change the localizer protocol as well.
Currently localizer deletes the entry from its pending resources map as soon as it sends a status. And NM will send a DIE in two cases - 1) If container has been killed 2)NM processes HB and finds one of the status to be FETCH_FAILED. In this case we cannot know if the resources for which fetch was success were actually processed by NM or not. So I maintain a separate list for resources reported to localizer. Hence a flag, so that even those resources can be deleted.