Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: jobtracker, tasktracker
    • Labels:
      None

      Description

      A job recovery mechanism to enable a job to re-execute only failed task upon job failed or jobtracker/tasktracker restart.

        Activity

        Hide
        Kang Xiao added a comment -

        The job recovery mechanism is targeted to solve three kinds of problem:

        1. If a long running job fails, it has to be re-submitted as a total new job and all tasks including succeededones have to be re-executed
        2. If we update a cluster to a new hadoop version, all running jobs need to re-run.
        3. If we restart a tasktracker, all running tasks and succeededmaps need to be re-executed.

        RecoveryManager of JobTracker solves some part of problem 2. However it just automatically re-run all running jobs, all succeededtasks still need to be re-executed.

        Show
        Kang Xiao added a comment - The job recovery mechanism is targeted to solve three kinds of problem: If a long running job fails, it has to be re-submitted as a total new job and all tasks including succeededones have to be re-executed If we update a cluster to a new hadoop version, all running jobs need to re-run. If we restart a tasktracker, all running tasks and succeededmaps need to be re-executed. RecoveryManager of JobTracker solves some part of problem 2. However it just automatically re-run all running jobs, all succeededtasks still need to be re-executed.
        Hide
        Scott Chen added a comment -

        Kang: Are you guys currently using RecoveryManager on your cluster?

        Show
        Scott Chen added a comment - Kang: Are you guys currently using RecoveryManager on your cluster?
        Hide
        Kang Xiao added a comment -

        Hi scott, we are not using RecoveryManager on our cluster. We have tested the fuction of RecoveryManager and found that it chould not satisfy our demand on task level recovery.

        Show
        Kang Xiao added a comment - Hi scott, we are not using RecoveryManager on our cluster. We have tested the fuction of RecoveryManager and found that it chould not satisfy our demand on task level recovery.

          People

          • Assignee:
            Unassigned
            Reporter:
            Kang Xiao
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development