I think this patch should also add the system directory to the clean up thread in the code path where job submission fails due to ACLs. In a majority of the cases, this action alone will prevent the problem from happening in the first place. However, this is only in addition to the changes in the patch as they are still needed to take care of cases where the job tracker could be restarted before the clean up thread has had a chance to delete the system directory completely.
Regarding cleanup, there seem to be two different cases here:
- The job was never submitted in the first place
- The job was running in the first place, and after restart it can no longer run because the ACLs were changed.
I think the patch is cleanly handling the first case (with the comments incorporated). In the second case, ideally the job should be killed by the JobTracker so that all parts related to the job (system directory, running tasks, cleanup task, etc) are cleaned up properly. I am thinking handling the second case (which ideally should be rare) should be a separate jira. Thoughts ?