Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3837

Job tracker is not able to recover job in case of crash and after that no user can submit job.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0, 1.1.1
    • Fix Version/s: 1.1.0, 0.22.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job.

      However the current behavior is as follows
      jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job.

      Thanks,
      Mayank

      1. PATCH-MAPREDUCE-3837.patch
        1 kB
        Mayank Bansal
      2. PATCH-TRUNK-MAPREDUCE-3837.patch
        1 kB
        Mayank Bansal
      3. PATCH-HADOOP-1-MAPREDUCE-3837.patch
        18 kB
        Mayank Bansal
      4. PATCH-HADOOP-1-MAPREDUCE-3837-1.patch
        18 kB
        Mayank Bansal
      5. PATCH-HADOOP-1-MAPREDUCE-3837-2.patch
        18 kB
        Mayank Bansal
      6. PATCH-HADOOP-1-MAPREDUCE-3837-3.patch
        31 kB
        Mayank Bansal
      7. PATCH-HADOOP-1-MAPREDUCE-3837-4.patch
        40 kB
        Mayank Bansal
      8. MAPREDUCE-3837_addendum.patch
        0.6 kB
        Arun C Murthy

        Activity

        Hide
        Mayank Bansal added a comment -

        PATCH-MAPREDUCE-3837.patch

        this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - PATCH- MAPREDUCE-3837 .patch this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well. Thanks, Mayank
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console This message is automatically generated.
        Hide
        Konstantin Shvachko added a comment -

        +1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.

        Show
        Konstantin Shvachko added a comment - +1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.
        Hide
        Konstantin Shvachko added a comment -

        I just committed this. Thank you Mayank.

        Show
        Konstantin Shvachko added a comment - I just committed this. Thank you Mayank.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Mahadev konar added a comment -

        @Mayank,
        You should Grant license to Apache when uploading patches.

        Show
        Mahadev konar added a comment - @Mayank, You should Grant license to Apache when uploading patches.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

        Result = ABORTED
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

        Result = ABORTED
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

        Result = FAILURE
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700
        Files :

        • /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt
        • /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700 Files : /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

        Result = FAILURE
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

        Result = FAILURE
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/)
        MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

        Result = SUCCESS
        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
        Hide
        Mayank Bansal added a comment -

        For Haddop-1 Patch

        Show
        Mayank Bansal added a comment - For Haddop-1 Patch
        Hide
        Mayank Bansal added a comment -

        Attached the patch for Hadoop -1, please review that.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Attached the patch for Hadoop -1, please review that. Thanks, Mayank
        Hide
        Alejandro Abdelnur added a comment -

        Mayank,

        • Built branch-1 with your patch
        • Configured the cluster, run a job test is OK
        • Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true'
        • Restarted the JT
        • Created a IN data file in my HDFS home dir
        • Submitted 5 wordcount jobs
        bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 &
        bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 &
        bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 &
        bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 &
        bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 &
        
        • Waited till they are all running
        • Killed the JT
        • Restarted the JT

        The jobs are not recovered, and what I see in the logs is:

        2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!!
        2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!!
        2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!!
        2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!!
        2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!!
        

        Am I missing some additional configuration?

        Show
        Alejandro Abdelnur added a comment - Mayank, Built branch-1 with your patch Configured the cluster, run a job test is OK Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true' Restarted the JT Created a IN data file in my HDFS home dir Submitted 5 wordcount jobs bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 & Waited till they are all running Killed the JT Restarted the JT The jobs are not recovered, and what I see in the logs is: 2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!! 2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!! 2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!! 2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!! 2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!! Am I missing some additional configuration?
        Hide
        Arun C Murthy added a comment -

        -1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.

        Show
        Arun C Murthy added a comment - -1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.
        Hide
        Mayank Bansal added a comment -

        Hi Alejandro

        Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before.

        Please let me know if you find any more issues with that.

        Arun,

        I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run.

        In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs.

        Please let me know your thoughts.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Hi Alejandro Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before. Please let me know if you find any more issues with that. Arun, I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run. In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs. Please let me know your thoughts. Thanks, Mayank
        Hide
        Alejandro Abdelnur added a comment -

        I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in.

        Arun, would you reconsider based on the explanation of what Mayank's patch does?

        Show
        Alejandro Abdelnur added a comment - I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in. Arun, would you reconsider based on the explanation of what Mayank's patch does?
        Hide
        Konstantin Shvachko added a comment -

        I've been reviewing this patch, and have a couple of cosmetic comments below.
        I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated.
        The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun.

        Cosmetic comments

        • Several lines are too long
        • See several tabs - should be spaces
        • indentation is wrong in couple of places
          recoveryManager.addJobForRecovery(JobID.forName(fileName));
          shouldRecover = true; // enable actual recovery if num-files > 1
        • Add spaces after commas in method calls and parameters
          Otherwise it looks good.
        Show
        Konstantin Shvachko added a comment - I've been reviewing this patch, and have a couple of cosmetic comments below. I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated. The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun. Cosmetic comments Several lines are too long See several tabs - should be spaces indentation is wrong in couple of places recoveryManager.addJobForRecovery(JobID.forName(fileName)); shouldRecover = true; // enable actual recovery if num-files > 1 Add spaces after commas in method calls and parameters Otherwise it looks good.
        Hide
        Mayank Bansal added a comment -

        Incorporating review comments

        Show
        Mayank Bansal added a comment - Incorporating review comments
        Hide
        Arun C Murthy added a comment -

        Apologies for the late response, I missed this.

        Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash.

        Couple of questions:
        a) How does it work in a secure setting?
        b) We should at least add some docs on this feature.

        Makes sense?

        Show
        Arun C Murthy added a comment - Apologies for the late response, I missed this. Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash. Couple of questions: a) How does it work in a secure setting? b) We should at least add some docs on this feature. Makes sense?
        Hide
        Mayank Bansal added a comment -

        Thanks Arun for your reply.

        a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user.
        b) Yeah you are right, I will add the documentation and append it to the patch.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Thanks Arun for your reply. a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user. b) Yeah you are right, I will add the documentation and append it to the patch. Thanks, Mayank
        Hide
        Tom White added a comment -

        TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?

        Show
        Tom White added a comment - TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?
        Hide
        Mayank Bansal added a comment -

        When I put this patch it did not have this issue,Let me update the patch.
        Thanks for finding this out.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - When I put this patch it did not have this issue,Let me update the patch. Thanks for finding this out. Thanks, Mayank
        Hide
        Todd Lipcon added a comment -

        Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?

        Show
        Todd Lipcon added a comment - Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?
        Hide
        Mayank Bansal added a comment -

        Hi Todd,

        Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK.

        What Arun told me more or less the patch is the same in HDP but one bug fix which he did.

        I will update the patch based on Tom's comment.

        Arun can you also put the bug fix which you did ?

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Hi Todd, Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK. What Arun told me more or less the patch is the same in HDP but one bug fix which he did. I will update the patch based on Tom's comment. Arun can you also put the bug fix which you did ? Thanks, Mayank
        Hide
        Mayank Bansal added a comment -

        Hi Tom,

        I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Hi Tom, I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing. Thanks, Mayank
        Hide
        Tom White added a comment -

        Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?

        Show
        Tom White added a comment - Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?
        Hide
        Mayank Bansal added a comment -

        Agree, working on it will update soon.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Agree, working on it will update soon. Thanks, Mayank
        Hide
        Arun C Murthy added a comment -

        Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad.

        Other nits:

        1. You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE).
        2. Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.
        Show
        Arun C Murthy added a comment - Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad. Other nits: You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE). Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.
        Hide
        Mayank Bansal added a comment -

        Hi Arun,

        As suggested by you

        1) I added the credentials to resubmit api.
        2) I added the isJobdirvalid api as well.
        3) my patch already uses jobid instead of jobinfo so no change required.

        Hi Tom,

        I added the new test case and fixed the recoverymanager test case well in the latest patch.

        I fixed one more issue in terms of recovery which i found here in production.

        Please review the patch.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Hi Arun, As suggested by you 1) I added the credentials to resubmit api. 2) I added the isJobdirvalid api as well. 3) my patch already uses jobid instead of jobinfo so no change required. Hi Tom, I added the new test case and fixed the recoverymanager test case well in the latest patch. I fixed one more issue in terms of recovery which i found here in production. Please review the patch. Thanks, Mayank
        Hide
        Tom White added a comment -

        Mayank - thanks for the changes. Here's my feedback:

        • If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?
        • In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.)
        • Remove the @Ignore annotation from TestRecoveryManager and the comment about MAPREDUCE-873.
        • The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job.
        • There's a comment saying it is a "faulty job" - which it isn't.
        • Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure.
        • testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile
        • testRecoveryManager would be better named testJobTrackerRestartWithBadJobs
        • There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above.
        • TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.
        Show
        Tom White added a comment - Mayank - thanks for the changes. Here's my feedback: If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.) Remove the @Ignore annotation from TestRecoveryManager and the comment about MAPREDUCE-873 . The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job. There's a comment saying it is a "faulty job" - which it isn't. Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure. testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile testRecoveryManager would be better named testJobTrackerRestartWithBadJobs There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above. TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.
        Hide
        Mayank Bansal added a comment -

        Thanks Tom for your comments. I incorporated everything except below point

        If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?

        Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA?

        Rest of the comments are incorporated in my latest patch.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Thanks Tom for your comments. I incorporated everything except below point If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA? Rest of the comments are incorporated in my latest patch. Thanks, Mayank
        Hide
        Mayank Bansal added a comment -

        Attaching latest patch after incorporating Tom's comments.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Attaching latest patch after incorporating Tom's comments. Thanks, Mayank
        Hide
        Tom White added a comment -

        +1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please.

        Cleaning up the restart count code in a separate JIRA is fine by me.

        Show
        Tom White added a comment - +1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please. Cleaning up the restart count code in a separate JIRA is fine by me.
        Hide
        Mayank Bansal added a comment -

        Test Patch Results are as follows:

        [exec] BUILD SUCCESSFUL
        [exec] Total time: 4 minutes 7 seconds
        [exec]
        [exec]
        [exec]
        [exec]
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 9 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]
        [exec]
        [exec]
        [exec]
        [exec] ======================================================================
        [exec] ======================================================================
        [exec] Finished build.
        [exec] ======================================================================
        [exec] ======================================================================

        Show
        Mayank Bansal added a comment - Test Patch Results are as follows: [exec] BUILD SUCCESSFUL [exec] Total time: 4 minutes 7 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ======================================================================
        Hide
        Mayank Bansal added a comment -

        I just now completed commit-tests successfully.
        I ran all unit test previously before attaching the patch those as well completed successfully.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - I just now completed commit-tests successfully. I ran all unit test previously before attaching the patch those as well completed successfully. Thanks, Mayank
        Hide
        Tom White added a comment -

        I just committed this to branch-1. Thanks Mayank!

        Show
        Tom White added a comment - I just committed this to branch-1. Thanks Mayank!
        Hide
        Arun C Murthy added a comment -

        Looks like this needs a minor update to get it to work on Mac OSX...

        Show
        Arun C Murthy added a comment - Looks like this needs a minor update to get it to work on Mac OSX...
        Hide
        Arun C Murthy added a comment -

        I see this on a single node cluster.

        Without this patch, tasks which are re-run fail with:

        2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
        Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed.
        	at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104)
        	at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
        	at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)
        

        The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call.

        Straight-fwd patch to check for existence fixes it.

        Show
        Arun C Murthy added a comment - I see this on a single node cluster. Without this patch, tasks which are re-run fail with: 2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call. Straight-fwd patch to check for existence fixes it.
        Hide
        Arun C Murthy added a comment -

        Looks like this needs a minor update to get it to work on Mac OSX...

        Could be any single-node cluster too...

        Show
        Arun C Murthy added a comment - Looks like this needs a minor update to get it to work on Mac OSX... Could be any single-node cluster too...
        Hide
        Tom White added a comment -

        +1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).

        Show
        Tom White added a comment - +1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).
        Hide
        Mayank Bansal added a comment -

        Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me.

        +1 Thanks Arun.

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me. +1 Thanks Arun. Thanks, Mayank
        Hide
        Arun C Murthy added a comment -

        Thanks for the reviews Tom & Mayank. I've just committed the small patch.

        Show
        Arun C Murthy added a comment - Thanks for the reviews Tom & Mayank. I've just committed the small patch.
        Hide
        Arun C Murthy added a comment -

        I just merged this to branch-1.1 after Matt's go ahead.

        Show
        Arun C Murthy added a comment - I just merged this to branch-1.1 after Matt's go ahead.
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop-1.1.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.
        Hide
        Matt Foley added a comment - - edited

        It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix.
        The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.

        Show
        Matt Foley added a comment - - edited It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix. The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.
        Hide
        Manish Malhotra added a comment -

        Hi,

        Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff.
        I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch.
        I believe 1.0.2 is descendent of 0.20.2.

        So, please let me know does any of this patch will work for 1.0.2 or not.

        Regards,
        Manish

        Show
        Manish Malhotra added a comment - Hi, Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff. I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch. I believe 1.0.2 is descendent of 0.20.2. So, please let me know does any of this patch will work for 1.0.2 or not. Regards, Manish

          People

          • Assignee:
            Mayank Bansal
            Reporter:
            Mayank Bansal
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development