Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-737

TaskTracker's job cleanup loop should check for finished job before deleting local directories

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 0.10.0
    • None
    • None

    Description

      TaskTracker uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.

      Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions.

      Possible solutions :
      1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory.
      2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there.

      There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work.

      Comments ?

      Attachments

        Issue Links

          Activity

            People

              acmurthy Arun Murthy
              sanjay.dahiya Sanjay Dahiya
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: