Affects Version/s: None
Fix Version/s: 0.10.0
TaskTracker uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in
HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.
Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions.
Possible solutions :
1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory.
2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there.
There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work.