Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute.
      This JIRA removes a public constructor in JobInProgress.
      Show
      Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute. This JIRA removes a public constructor in JobInProgress.

      Description

      On a couple of occasions we have seen the JobTracker not being able to handle job recovery well, and leading to cluster downtime after a restart. The current design for handling job recovery is complex and prone to corner cases not being handled well enough. In retrospect, it seems like the transaction log based approach as was proposed on HADOOP-3245 (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, that is a big project, and it seems for the medium term, just handling job re-submissions after a restart is a good tradeoff. That is, the JobTracker after getting restarted, will resubmit all jobs that were running in its past life. They will all start from the beginning (downside is completed tasks will reexecute). In the long term, the transaction log model or some variant of that should be pursued.

      Thoughts/comments welcome.

      1. 873_v1.patch
        69 kB
        Sharad Agarwal
      2. 873_v2.patch
        101 kB
        Sharad Agarwal
      3. 873_v3.patch
        100 kB
        Sharad Agarwal

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        1d 1h 47m 1 Sharad Agarwal 26/Aug/09 10:25
        Open Open Patch Available Patch Available
        10d 18h 50m 2 Sharad Agarwal 26/Aug/09 10:26
        Patch Available Patch Available Resolved Resolved
        5d 23h 18m 1 Sharad Agarwal 01/Sep/09 09:45
        Resolved Resolved Closed Closed
        357d 12h 30m 1 Tom White 24/Aug/10 22:15
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #70 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/70/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #70 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/70/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #9 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/9/)
        . Moving the CHANGES.txt comment to Incompatible section.
        . Simplify job recovery. InComplete jobs are resubmitted on jobtracker restart. Contributed by Sharad Agarwal.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #9 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/9/ ) . Moving the CHANGES.txt comment to Incompatible section. . Simplify job recovery. InComplete jobs are resubmitted on jobtracker restart. Contributed by Sharad Agarwal.
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Incompatible change] [Incompatible change, Reviewed]
        Release Note Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute.
        This JIRA removes a public constructor in JobInProgress.
        Resolution Fixed [ 1 ]
        Hide
        Sharad Agarwal added a comment -

        I just committed this.

        Show
        Sharad Agarwal added a comment - I just committed this.
        Hide
        Devaraj Das added a comment -

        +1

        Show
        Devaraj Das added a comment - +1
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12417718/873_v3.patch
        against trunk revision 808082.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 21 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417718/873_v3.patch against trunk revision 808082. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/522/console This message is automatically generated.
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sharad Agarwal made changes -
        Attachment 873_v3.patch [ 12417718 ]
        Hide
        Sharad Agarwal added a comment -

        Fixed the TestJobHistory failure. Incorporated Devaraj's offline comments to keep the retry loop for recoveryManager.updateRestartCount() in JobTracker#offerservice()

        Show
        Sharad Agarwal added a comment - Fixed the TestJobHistory failure. Incorporated Devaraj's offline comments to keep the retry loop for recoveryManager.updateRestartCount() in JobTracker#offerservice()
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12417579/873_v2.patch
        against trunk revision 807165.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 21 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417579/873_v2.patch against trunk revision 807165. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/514/console This message is automatically generated.
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hadoop Flags [Incompatible change]
        Sharad Agarwal made changes -
        Attachment 873_v2.patch [ 12417579 ]
        Hide
        Sharad Agarwal added a comment -

        Patch for review. It does following:
        Recovery no more depends on job history. Logic to replay history events is removed.
        Jobs are recovered based on job files present in mapred system dir.
        Job info file containing job tracker restart count is retained as it is required to avoid task attempt id clashes for recovered jobs.
        When job tracker comes up, the job history files from last run are moved to "mapred.job.tracker.history.completed.location" with the suffix added as "." + jtIdentifier +".old". This is done to avoid over writing the history files for recovered jobs.
        TestJobTrackerSafeMode, TestJobTrackerRestart and TestJobTrackerRestartWithLostTracker are removed.

        Show
        Sharad Agarwal added a comment - Patch for review. It does following: Recovery no more depends on job history. Logic to replay history events is removed. Jobs are recovered based on job files present in mapred system dir. Job info file containing job tracker restart count is retained as it is required to avoid task attempt id clashes for recovered jobs. When job tracker comes up, the job history files from last run are moved to "mapred.job.tracker.history.completed.location" with the suffix added as "." + jtIdentifier +".old". This is done to avoid over writing the history files for recovered jobs. TestJobTrackerSafeMode, TestJobTrackerRestart and TestJobTrackerRestartWithLostTracker are removed.
        Sharad Agarwal made changes -
        Field Original Value New Value
        Attachment 873_v1.patch [ 12417145 ]
        Hide
        Sharad Agarwal added a comment -

        Early patch. Testing in progress. It:

        • removes the old recovery logic.
        • recovery is done by submitting the jobIds from the mapred.system dir to Jobtracker#submitJob
        Show
        Sharad Agarwal added a comment - Early patch. Testing in progress. It: removes the old recovery logic. recovery is done by submitting the jobIds from the mapred.system dir to Jobtracker#submitJob
        Devaraj Das created issue -

          People

          • Assignee:
            Sharad Agarwal
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development