The job recovery mechanism is targeted to solve three kinds of problem:
- If a long running job fails, it has to be re-submitted as a total new job and all tasks including succeededones have to be re-executed
- If we update a cluster to a new hadoop version, all running jobs need to re-run.
- If we restart a tasktracker, all running tasks and succeededmaps need to be re-executed.
RecoveryManager of JobTracker solves some part of problem 2. However it just automatically re-run all running jobs, all succeededtasks still need to be re-executed.