Hadoop Common
  1. Hadoop Common
  2. HADOOP-3370

failed tasks may stay forever in TaskTracker.runningJobs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.2
    • Component/s: None
    • Labels:
      None

      Description

      The net effect of this is that, with a long-running TaskTracker, it takes long long time for ReduceTasks on that TaskTracker to fetch map outputs - TaskTracker does that for all reduce tasks in TaskTracker .runningJobs, including those stale ReduceTasks. There is a 5-second delay between 2 requests, which makes it a long time for a running reducetask to get the map output locations, when there are tens of stale ReduceTasks. Of course this also blows up the memory but that is not a too big problem at its rate.

      I've verified the bug by adding an html table for TaskTracker.runningJobs on TaskTracker http interface, on a 2-node machine, with a single mapper single reducer job, in which mapper succeeds and reducer fails. I can still see the ReduceTask in TaskTracker.runningJobs, while it's not in the first 2 tables (TaskTracker.tasks and TaskTracker.runningTasks).

      Details:

      TaskRunner.run() will call TaskTracker.reportTaskFinished() when the task fails,
      which calls TaskTracker.TaskInProgress.taskFinished,
      which calls TaskTracker.TaskInProgress.cleanup(),
      which calls TaskTracker.tasks.remove(taskId).

      In short, it remove a failed task from TaskTracker.tasks, but not TaskTracker.runningJobs.

      Then the failure is reported to JobTracker.

      JobTracker.heartbeat will call processHeartbeat,
      which calls updateTaskStatuses,
      which calls tip.getJob().updateTaskStatus,
      which calls JobInProgress.failedTask,
      which calls JobTracker.markCompletedTaskAttempt,
      which puts the task to trackerToMarkedTasksMap,

      and then JobTracker.heartbeat will call removeMarkedTasks,
      which call removeTaskEntry,
      which removes it from trackerToTaskMap.

      JobTracker.heartbeat will also call JobTracker.getTasksToKill,
      which reads from trackerToTaskMap for <tracker, task> pairs,
      and ask tracker to KILL the task or job of the task.

      In the case there is only one task for a specific job on a specific tracker
      and that task failed (NOTE: and that task is not the last failed try of the
      job - otherwise JobTracker.getTasksToKill will pick it up before
      removeMarkedTasks comes in and remove it from trackerToTaskMap), the task
      tracker will not receive the KILL task or KILL job message from the JobTracker.
      As a result, the task will remain in TaskTracker.runningJobs forever.

      Solution:
      Remove the task from TaskTracker.runningJobs at the same time when we remove it from TaskTracker.tasks.

      1. 3370-1.patch
        3 kB
        Zheng Shao
      2. 3370-2.patch
        2 kB
        Zheng Shao
      3. 3370-3.patch
        2 kB
        Zheng Shao
      4. 3370-4.patch
        2 kB
        Zheng Shao
      5. patch-3370-0.17.txt
        2 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Zheng Shao created issue -
          Zheng Shao made changes -
          Field Original Value New Value
          Attachment 3370-1.patch [ 12381804 ]
          Zheng Shao made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Zheng Shao made changes -
          Assignee Zheng Shao [ zshao ]
          Arun C Murthy made changes -
          Affects Version/s 0.17.0 [ 12312913 ]
          Fix Version/s 0.18.0 [ 12312972 ]
          Status Patch Available [ 10002 ] Open [ 1 ]
          Zheng Shao made changes -
          Attachment 3370-2.patch [ 12382012 ]
          Zheng Shao made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Zheng Shao made changes -
          Link This issue relates to HADOOP-3386 [ HADOOP-3386 ]
          Zheng Shao made changes -
          Attachment 3370-3.patch [ 12382013 ]
          Zheng Shao made changes -
          Attachment 3370-3.patch [ 12382013 ]
          Zheng Shao made changes -
          Attachment 3370-3.patch [ 12382014 ]
          Zheng Shao made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Zheng Shao made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Zheng Shao made changes -
          Attachment 3370-4.patch [ 12382024 ]
          Zheng Shao made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Zheng Shao made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Arun C Murthy made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Amareshwari Sriramadasu made changes -
          Attachment patch-3370-0.17.txt [ 12385574 ]
          Amareshwari Sriramadasu made changes -
          Link This issue incorporates HADOOP-3713 [ HADOOP-3713 ]
          Arun C Murthy made changes -
          Fix Version/s 0.18.0 [ 12312972 ]
          Fix Version/s 0.17.2 [ 12313296 ]
          Owen O'Malley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development