Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1682

Tasks should not be scheduled after tip is killed/failed.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: jobtracker
    • Labels:
      None

      Description

      We have seen the following scenario in our cluster:
      A job got marked failed, because four attempts of a TIP failed. This would kill all the map and reduce tips. Then a job-cleanup attempt is launched.
      The job-cleanup attempt failed because it could not report status for 10 minutes. There are 3 such job-cleanup attempts leading the job to get killed after 1/2 hour.
      While waiting for the job cleanup to finish, JobTracker scheduled many tasks of the job on TaskTrackers and sent a KillTaskAction in the next heartbeat.

      This is just wasting lots of resources, we should avoid scheduling tasks of a tip once the tip is killed/failed.

        Activity

        Hide
        Amareshwari Sriramadasu added a comment -

        A quick look at the JobInProgress code says "In JobInProgress.findSpeculativeTask(), tip.isRunnable() check is not done,
        whereas the method JobInProgress.findTaskFromList() does the check to skip failed/killed tips".

        The bug is not there in 0.21 or trunk (got fixed in HADOOP:2141), it is there only in branch 0.20.

        Show
        Amareshwari Sriramadasu added a comment - A quick look at the JobInProgress code says "In JobInProgress.findSpeculativeTask(), tip.isRunnable() check is not done, whereas the method JobInProgress.findTaskFromList() does the check to skip failed/killed tips". The bug is not there in 0.21 or trunk (got fixed in HADOOP:2141), it is there only in branch 0.20.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        The above code bug is also responsible for some corner case issues because of which a job never finishes. We saw some scenarios in which speculative attempts get launched and get killed immediately in the order of seconds. This happens continuously for ever and the job never ends.

        Show
        Vinod Kumar Vavilapalli added a comment - The above code bug is also responsible for some corner case issues because of which a job never finishes. We saw some scenarios in which speculative attempts get launched and get killed immediately in the order of seconds. This happens continuously for ever and the job never ends.
        Hide
        Todd Lipcon added a comment -

        Attaching YDH patch by Arun (patch does what comment suggests above)

        Show
        Todd Lipcon added a comment - Attaching YDH patch by Arun (patch does what comment suggests above)
        Hide
        Allen Wittenauer added a comment -

        Likely fixed.

        Show
        Allen Wittenauer added a comment - Likely fixed.

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Amareshwari Sriramadasu
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development