[MAPREDUCE-873] Simplify Job Recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.20.1
Fix Version/s: 0.21.0
Component/s: jobtracker
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute.
This JIRA removes a public constructor in JobInProgress.

Show
Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute. This JIRA removes a public constructor in JobInProgress.

Description

On a couple of occasions we have seen the JobTracker not being able to handle job recovery well, and leading to cluster downtime after a restart. The current design for handling job recovery is complex and prone to corner cases not being handled well enough. In retrospect, it seems like the transaction log based approach as was proposed on ~~HADOOP-3245~~ (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, that is a big project, and it seems for the medium term, just handling job re-submissions after a restart is a good tradeoff. That is, the JobTracker after getting restarted, will resubmit all jobs that were running in its past life. They will all start from the beginning (downside is completed tasks will reexecute). In the long term, the transaction log model or some variant of that should be pursued.

Thoughts/comments welcome.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

873_v3.patch
26/Aug/09 09:24
100 kB
Sharad Agarwal
873_v2.patch
25/Aug/09 07:35
101 kB
Sharad Agarwal
873_v1.patch
20/Aug/09 16:15
69 kB
Sharad Agarwal

Sub-Tasks

TestNodeRefresh timesout occasionally

Closed

Amar Kamat

Activity

People

Assignee:: Sharad Agarwal

Reporter:: Devaraj Das

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 14/Aug/09 12:47

Updated:: 24/Aug/10 21:15

Resolved:: 01/Sep/09 08:45