Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-873

Simplify Job Recovery

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.20.1
    • 0.21.0
    • jobtracker
    • None
    • Incompatible change, Reviewed
    • Hide
      Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute.
      This JIRA removes a public constructor in JobInProgress.
      Show
      Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute. This JIRA removes a public constructor in JobInProgress.

    Description

      On a couple of occasions we have seen the JobTracker not being able to handle job recovery well, and leading to cluster downtime after a restart. The current design for handling job recovery is complex and prone to corner cases not being handled well enough. In retrospect, it seems like the transaction log based approach as was proposed on HADOOP-3245 (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, that is a big project, and it seems for the medium term, just handling job re-submissions after a restart is a good tradeoff. That is, the JobTracker after getting restarted, will resubmit all jobs that were running in its past life. They will all start from the beginning (downside is completed tasks will reexecute). In the long term, the transaction log model or some variant of that should be pursued.

      Thoughts/comments welcome.

      Attachments

        1. 873_v1.patch
          69 kB
          Sharad Agarwal
        2. 873_v2.patch
          101 kB
          Sharad Agarwal
        3. 873_v3.patch
          100 kB
          Sharad Agarwal

        Activity

          People

            sharadag Sharad Agarwal
            ddas Devaraj Das
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: