Hadoop Common
  1. Hadoop Common
  2. HADOOP-3370

failed tasks may stay forever in TaskTracker.runningJobs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.2
    • Component/s: None
    • Labels:
      None

      Description

      The net effect of this is that, with a long-running TaskTracker, it takes long long time for ReduceTasks on that TaskTracker to fetch map outputs - TaskTracker does that for all reduce tasks in TaskTracker .runningJobs, including those stale ReduceTasks. There is a 5-second delay between 2 requests, which makes it a long time for a running reducetask to get the map output locations, when there are tens of stale ReduceTasks. Of course this also blows up the memory but that is not a too big problem at its rate.

      I've verified the bug by adding an html table for TaskTracker.runningJobs on TaskTracker http interface, on a 2-node machine, with a single mapper single reducer job, in which mapper succeeds and reducer fails. I can still see the ReduceTask in TaskTracker.runningJobs, while it's not in the first 2 tables (TaskTracker.tasks and TaskTracker.runningTasks).

      Details:

      TaskRunner.run() will call TaskTracker.reportTaskFinished() when the task fails,
      which calls TaskTracker.TaskInProgress.taskFinished,
      which calls TaskTracker.TaskInProgress.cleanup(),
      which calls TaskTracker.tasks.remove(taskId).

      In short, it remove a failed task from TaskTracker.tasks, but not TaskTracker.runningJobs.

      Then the failure is reported to JobTracker.

      JobTracker.heartbeat will call processHeartbeat,
      which calls updateTaskStatuses,
      which calls tip.getJob().updateTaskStatus,
      which calls JobInProgress.failedTask,
      which calls JobTracker.markCompletedTaskAttempt,
      which puts the task to trackerToMarkedTasksMap,

      and then JobTracker.heartbeat will call removeMarkedTasks,
      which call removeTaskEntry,
      which removes it from trackerToTaskMap.

      JobTracker.heartbeat will also call JobTracker.getTasksToKill,
      which reads from trackerToTaskMap for <tracker, task> pairs,
      and ask tracker to KILL the task or job of the task.

      In the case there is only one task for a specific job on a specific tracker
      and that task failed (NOTE: and that task is not the last failed try of the
      job - otherwise JobTracker.getTasksToKill will pick it up before
      removeMarkedTasks comes in and remove it from trackerToTaskMap), the task
      tracker will not receive the KILL task or KILL job message from the JobTracker.
      As a result, the task will remain in TaskTracker.runningJobs forever.

      Solution:
      Remove the task from TaskTracker.runningJobs at the same time when we remove it from TaskTracker.tasks.

      1. patch-3370-0.17.txt
        2 kB
        Amareshwari Sriramadasu
      2. 3370-4.patch
        2 kB
        Zheng Shao
      3. 3370-3.patch
        2 kB
        Zheng Shao
      4. 3370-2.patch
        2 kB
        Zheng Shao
      5. 3370-1.patch
        3 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Arun C Murthy added a comment -

          I merged this into branch-0.17 also.

          Show
          Arun C Murthy added a comment - I merged this into branch-0.17 also.
          Hide
          Amareshwari Sriramadasu added a comment -

          All the tests passed on branch 0.17, on my machine

          Show
          Amareshwari Sriramadasu added a comment - All the tests passed on branch 0.17, on my machine
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for 0.17

          Show
          Amareshwari Sriramadasu added a comment - Patch for 0.17
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks, Zheng!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks, Zheng!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12382024/3370-4.patch
          against trunk revision 656270.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382024/3370-4.patch against trunk revision 656270. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2466/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment - - edited

          Looks like the failure is due logger intialization warnings.

          -------------------- DEBUG OUT---------------------
          Test Script
          Bailing out
          log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.TaskRunner).
          log4j:WARN Please initialize the log4j system properly.
          
          Show
          Amareshwari Sriramadasu added a comment - - edited Looks like the failure is due logger intialization warnings. -------------------- DEBUG OUT--------------------- Test Script Bailing out log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.TaskRunner). log4j:WARN Please initialize the log4j system properly.
          Hide
          Arun C Murthy added a comment -

          Looks like org.apache.hadoop.mapred.TestMiniMRMapRedDebugScript.testMapDebugScript failed:

          junit.framework.ComparisonFailure: expected:<...> but was:<...
          

          Looks like it might have broken the feature where u can added a debug-script for your map/reduce tasks, look at TestMiniMRMapRedDebugScript for an example.
          Does it succeed on your machine?

          Show
          Arun C Murthy added a comment - Looks like org.apache.hadoop.mapred.TestMiniMRMapRedDebugScript.testMapDebugScript failed: junit.framework.ComparisonFailure: expected:<...> but was:<... Looks like it might have broken the feature where u can added a debug-script for your map/reduce tasks, look at TestMiniMRMapRedDebugScript for an example. Does it succeed on your machine?
          Hide
          Zheng Shao added a comment -

          can somebody help me restart the hudson test? It seems there are some transient errors.

          Show
          Zheng Shao added a comment - can somebody help me restart the hudson test? It seems there are some transient errors.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12382014/3370-3.patch
          against trunk revision 656122.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382014/3370-3.patch against trunk revision 656122. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2462/console This message is automatically generated.
          Hide
          Zheng Shao added a comment -

          my fault. regenerated the patch.

          Show
          Zheng Shao added a comment - my fault. regenerated the patch.
          Hide
          dhruba borthakur added a comment -

          Thanks Arun for reviewing this one. Really appreciate it.

          I think the HadoopQA pach process had some problem in applying and merging this patch with trunk. Maybe I will cancel and re-submit this issue to re-trigger another Hudson test.

          Show
          dhruba borthakur added a comment - Thanks Arun for reviewing this one. Really appreciate it. I think the HadoopQA pach process had some problem in applying and merging this patch with trunk. Maybe I will cancel and re-submit this issue to re-trigger another Hudson test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12382012/3370-2.patch
          against trunk revision 656122.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2460/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382012/3370-2.patch against trunk revision 656122. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2460/console This message is automatically generated.
          Hide
          Zheng Shao added a comment -

          1. removed commented code;
          2. removed extra "this." where not necessary.

          I will put that "KILLJOB" thing in a separate issue - which helps cleaning out the local dirs of the job in corner cases.

          Show
          Zheng Shao added a comment - 1. removed commented code; 2. removed extra "this." where not necessary. I will put that "KILLJOB" thing in a separate issue - which helps cleaning out the local dirs of the job in corner cases.
          Hide
          Arun C Murthy added a comment -

          Zheng, apologies for being late to get to this - couple of comments:

          1. Please do not comment out code which is no longer required, just delete it.
          2. HADOOP-3297 changed the way we get TaskCompletionEvents, it is no longer once in 5s. Just FYI.
          3. If you don't mind, please do not use this.<func> whenever calling <func> suffices.
          4. As you mentioned, the other option is to send a KillJobAction for all trackers on which tasks ran at the end of the job. This is a really useful feature and would make me very happy if you took that route! smile - however, I won't hold it against this patch; we could do it as a separate issue.

          Show
          Arun C Murthy added a comment - Zheng, apologies for being late to get to this - couple of comments: 1. Please do not comment out code which is no longer required, just delete it. 2. HADOOP-3297 changed the way we get TaskCompletionEvents, it is no longer once in 5s. Just FYI. 3. If you don't mind, please do not use this.<func> whenever calling <func> suffices. 4. As you mentioned, the other option is to send a KillJobAction for all trackers on which tasks ran at the end of the job. This is a really useful feature and would make me very happy if you took that route! smile - however, I won't hold it against this patch; we could do it as a separate issue.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381804/3370-1.patch
          against trunk revision 654973.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381804/3370-1.patch against trunk revision 654973. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2444/console This message is automatically generated.
          Hide
          Zheng Shao added a comment -

          Simple fix. I also included code to show the content of runningTasks on http interface.

          Show
          Zheng Shao added a comment - Simple fix. I also included code to show the content of runningTasks on http interface.
          Hide
          Zheng Shao added a comment -

          Details about a potential solution:
          1. On failed task, remove the task from runningJobs, but do not delete runningJobs job entry even if it's the only task of the job; (which means we should NOT call TaskTracker.removeTaskFromJob)

          2. JobTracker should keep another data structure: jobsToTracker, for recording all the TaskTrackers that a job has started a task on.

          3. When the job finished, JobTracker will send "KILL" job command to the TaskTrackers, based on jobsToTracker data structure.

          An alternative:
          On failed task, remove the task from runningJobs, AND if it's the only task of the job, remove the job directory (which means we should call TaskTracker.removeTaskFromJob, PLUS delete the job directory)

          Show
          Zheng Shao added a comment - Details about a potential solution: 1. On failed task, remove the task from runningJobs, but do not delete runningJobs job entry even if it's the only task of the job; (which means we should NOT call TaskTracker.removeTaskFromJob) 2. JobTracker should keep another data structure: jobsToTracker, for recording all the TaskTrackers that a job has started a task on. 3. When the job finished, JobTracker will send "KILL" job command to the TaskTrackers, based on jobsToTracker data structure. An alternative: On failed task, remove the task from runningJobs, AND if it's the only task of the job, remove the job directory (which means we should call TaskTracker.removeTaskFromJob, PLUS delete the job directory)

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development