Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3837

Job tracker is not able to recover job in case of crash and after that no user can submit job.

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.22.0, 1.1.1
    • 1.1.0, 0.22.1
    • None
    • None
    • Reviewed

    Description

      If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job.

      However the current behavior is as follows
      jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job.

      Thanks,
      Mayank

      Attachments

        1. PATCH-TRUNK-MAPREDUCE-3837.patch
          1 kB
          Mayank Bansal
        2. PATCH-MAPREDUCE-3837.patch
          1 kB
          Mayank Bansal
        3. PATCH-HADOOP-1-MAPREDUCE-3837-4.patch
          40 kB
          Mayank Bansal
        4. PATCH-HADOOP-1-MAPREDUCE-3837-3.patch
          31 kB
          Mayank Bansal
        5. PATCH-HADOOP-1-MAPREDUCE-3837-2.patch
          18 kB
          Mayank Bansal
        6. PATCH-HADOOP-1-MAPREDUCE-3837-1.patch
          18 kB
          Mayank Bansal
        7. PATCH-HADOOP-1-MAPREDUCE-3837.patch
          18 kB
          Mayank Bansal
        8. MAPREDUCE-3837_addendum.patch
          0.6 kB
          Arun Murthy

        Activity

          mayank_bansal Mayank Bansal added a comment -

          PATCH-MAPREDUCE-3837.patch

          this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - PATCH- MAPREDUCE-3837 .patch this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well. Thanks, Mayank
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console This message is automatically generated.

          +1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.

          shv Konstantin Shvachko added a comment - +1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.

          I just committed this. Thank you Mayank.

          shv Konstantin Shvachko added a comment - I just committed this. Thank you Mayank.
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          mahadev Mahadev Konar added a comment -

          @Mayank,
          You should Grant license to Apache when uploading patches.

          mahadev Mahadev Konar added a comment - @Mayank, You should Grant license to Apache when uploading patches.
          hudson Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

          Result = ABORTED
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

          Result = ABORTED
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

          Result = FAILURE
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700
          Files :

          • /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt
          • /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700 Files : /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

          Result = FAILURE
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

          Result = FAILURE
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/)
          MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
          mayank_bansal Mayank Bansal added a comment -

          For Haddop-1 Patch

          mayank_bansal Mayank Bansal added a comment - For Haddop-1 Patch
          mayank_bansal Mayank Bansal added a comment -

          Attached the patch for Hadoop -1, please review that.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Attached the patch for Hadoop -1, please review that. Thanks, Mayank

          Mayank,

          • Built branch-1 with your patch
          • Configured the cluster, run a job test is OK
          • Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true'
          • Restarted the JT
          • Created a IN data file in my HDFS home dir
          • Submitted 5 wordcount jobs
          bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 &
          bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 &
          bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 &
          bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 &
          bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 &
          
          • Waited till they are all running
          • Killed the JT
          • Restarted the JT

          The jobs are not recovered, and what I see in the logs is:

          2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!!
          2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!!
          2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!!
          2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!!
          2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!!
          

          Am I missing some additional configuration?

          tucu00 Alejandro Abdelnur added a comment - Mayank, Built branch-1 with your patch Configured the cluster, run a job test is OK Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true' Restarted the JT Created a IN data file in my HDFS home dir Submitted 5 wordcount jobs bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 & Waited till they are all running Killed the JT Restarted the JT The jobs are not recovered, and what I see in the logs is: 2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!! 2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!! 2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!! 2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!! 2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!! Am I missing some additional configuration?
          acmurthy Arun Murthy added a comment -

          -1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.

          acmurthy Arun Murthy added a comment - -1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.
          mayank_bansal Mayank Bansal added a comment -

          Hi Alejandro

          Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before.

          Please let me know if you find any more issues with that.

          Arun,

          I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run.

          In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs.

          Please let me know your thoughts.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Hi Alejandro Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before. Please let me know if you find any more issues with that. Arun, I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run. In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs. Please let me know your thoughts. Thanks, Mayank

          I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in.

          Arun, would you reconsider based on the explanation of what Mayank's patch does?

          tucu00 Alejandro Abdelnur added a comment - I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in. Arun, would you reconsider based on the explanation of what Mayank's patch does?

          I've been reviewing this patch, and have a couple of cosmetic comments below.
          I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated.
          The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun.

          Cosmetic comments

          • Several lines are too long
          • See several tabs - should be spaces
          • indentation is wrong in couple of places
            recoveryManager.addJobForRecovery(JobID.forName(fileName));
            shouldRecover = true; // enable actual recovery if num-files > 1
          • Add spaces after commas in method calls and parameters
            Otherwise it looks good.
          shv Konstantin Shvachko added a comment - I've been reviewing this patch, and have a couple of cosmetic comments below. I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated. The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun. Cosmetic comments Several lines are too long See several tabs - should be spaces indentation is wrong in couple of places recoveryManager.addJobForRecovery(JobID.forName(fileName)); shouldRecover = true; // enable actual recovery if num-files > 1 Add spaces after commas in method calls and parameters Otherwise it looks good.
          mayank_bansal Mayank Bansal added a comment -

          Incorporating review comments

          mayank_bansal Mayank Bansal added a comment - Incorporating review comments
          acmurthy Arun Murthy added a comment -

          Apologies for the late response, I missed this.

          Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash.

          Couple of questions:
          a) How does it work in a secure setting?
          b) We should at least add some docs on this feature.

          Makes sense?

          acmurthy Arun Murthy added a comment - Apologies for the late response, I missed this. Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash. Couple of questions: a) How does it work in a secure setting? b) We should at least add some docs on this feature. Makes sense?
          mayank_bansal Mayank Bansal added a comment -

          Thanks Arun for your reply.

          a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user.
          b) Yeah you are right, I will add the documentation and append it to the patch.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Thanks Arun for your reply. a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user. b) Yeah you are right, I will add the documentation and append it to the patch. Thanks, Mayank
          tomwhite Thomas White added a comment -

          TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?

          tomwhite Thomas White added a comment - TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?
          mayank_bansal Mayank Bansal added a comment -

          When I put this patch it did not have this issue,Let me update the patch.
          Thanks for finding this out.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - When I put this patch it did not have this issue,Let me update the patch. Thanks for finding this out. Thanks, Mayank
          tlipcon Todd Lipcon added a comment -

          Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?

          tlipcon Todd Lipcon added a comment - Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?
          mayank_bansal Mayank Bansal added a comment -

          Hi Todd,

          Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK.

          What Arun told me more or less the patch is the same in HDP but one bug fix which he did.

          I will update the patch based on Tom's comment.

          Arun can you also put the bug fix which you did ?

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Hi Todd, Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK. What Arun told me more or less the patch is the same in HDP but one bug fix which he did. I will update the patch based on Tom's comment. Arun can you also put the bug fix which you did ? Thanks, Mayank
          mayank_bansal Mayank Bansal added a comment -

          Hi Tom,

          I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Hi Tom, I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing. Thanks, Mayank
          tomwhite Thomas White added a comment -

          Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?

          tomwhite Thomas White added a comment - Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?
          mayank_bansal Mayank Bansal added a comment -

          Agree, working on it will update soon.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Agree, working on it will update soon. Thanks, Mayank
          acmurthy Arun Murthy added a comment -

          Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad.

          Other nits:

          1. You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE).
          2. Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.
          acmurthy Arun Murthy added a comment - Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad. Other nits: You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE). Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.
          mayank_bansal Mayank Bansal added a comment -

          Hi Arun,

          As suggested by you

          1) I added the credentials to resubmit api.
          2) I added the isJobdirvalid api as well.
          3) my patch already uses jobid instead of jobinfo so no change required.

          Hi Tom,

          I added the new test case and fixed the recoverymanager test case well in the latest patch.

          I fixed one more issue in terms of recovery which i found here in production.

          Please review the patch.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Hi Arun, As suggested by you 1) I added the credentials to resubmit api. 2) I added the isJobdirvalid api as well. 3) my patch already uses jobid instead of jobinfo so no change required. Hi Tom, I added the new test case and fixed the recoverymanager test case well in the latest patch. I fixed one more issue in terms of recovery which i found here in production. Please review the patch. Thanks, Mayank
          tomwhite Thomas White added a comment -

          Mayank - thanks for the changes. Here's my feedback:

          • If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?
          • In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.)
          • Remove the @Ignore annotation from TestRecoveryManager and the comment about MAPREDUCE-873.
          • The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job.
          • There's a comment saying it is a "faulty job" - which it isn't.
          • Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure.
          • testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile
          • testRecoveryManager would be better named testJobTrackerRestartWithBadJobs
          • There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above.
          • TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.
          tomwhite Thomas White added a comment - Mayank - thanks for the changes. Here's my feedback: If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.) Remove the @Ignore annotation from TestRecoveryManager and the comment about MAPREDUCE-873 . The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job. There's a comment saying it is a "faulty job" - which it isn't. Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure. testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile testRecoveryManager would be better named testJobTrackerRestartWithBadJobs There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above. TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.
          mayank_bansal Mayank Bansal added a comment -

          Thanks Tom for your comments. I incorporated everything except below point

          If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?

          Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA?

          Rest of the comments are incorporated in my latest patch.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Thanks Tom for your comments. I incorporated everything except below point If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA? Rest of the comments are incorporated in my latest patch. Thanks, Mayank
          mayank_bansal Mayank Bansal added a comment -

          Attaching latest patch after incorporating Tom's comments.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Attaching latest patch after incorporating Tom's comments. Thanks, Mayank
          tomwhite Thomas White added a comment -

          +1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please.

          Cleaning up the restart count code in a separate JIRA is fine by me.

          tomwhite Thomas White added a comment - +1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please. Cleaning up the restart count code in a separate JIRA is fine by me.
          mayank_bansal Mayank Bansal added a comment -

          Test Patch Results are as follows:

          [exec] BUILD SUCCESSFUL
          [exec] Total time: 4 minutes 7 seconds
          [exec]
          [exec]
          [exec]
          [exec]
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 9 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
          [exec]
          [exec]
          [exec]
          [exec]
          [exec] ======================================================================
          [exec] ======================================================================
          [exec] Finished build.
          [exec] ======================================================================
          [exec] ======================================================================

          mayank_bansal Mayank Bansal added a comment - Test Patch Results are as follows: [exec] BUILD SUCCESSFUL [exec] Total time: 4 minutes 7 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ======================================================================
          mayank_bansal Mayank Bansal added a comment -

          I just now completed commit-tests successfully.
          I ran all unit test previously before attaching the patch those as well completed successfully.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - I just now completed commit-tests successfully. I ran all unit test previously before attaching the patch those as well completed successfully. Thanks, Mayank
          tomwhite Thomas White added a comment -

          I just committed this to branch-1. Thanks Mayank!

          tomwhite Thomas White added a comment - I just committed this to branch-1. Thanks Mayank!
          acmurthy Arun Murthy added a comment -

          Looks like this needs a minor update to get it to work on Mac OSX...

          acmurthy Arun Murthy added a comment - Looks like this needs a minor update to get it to work on Mac OSX...
          acmurthy Arun Murthy added a comment -

          I see this on a single node cluster.

          Without this patch, tasks which are re-run fail with:

          2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error
          	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
          Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed.
          	at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104)
          	at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
          	at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
          	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)
          

          The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call.

          Straight-fwd patch to check for existence fixes it.

          acmurthy Arun Murthy added a comment - I see this on a single node cluster. Without this patch, tasks which are re-run fail with: 2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call. Straight-fwd patch to check for existence fixes it.
          acmurthy Arun Murthy added a comment -

          Looks like this needs a minor update to get it to work on Mac OSX...

          Could be any single-node cluster too...

          acmurthy Arun Murthy added a comment - Looks like this needs a minor update to get it to work on Mac OSX... Could be any single-node cluster too...
          tomwhite Thomas White added a comment -

          +1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).

          tomwhite Thomas White added a comment - +1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).
          mayank_bansal Mayank Bansal added a comment -

          Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me.

          +1 Thanks Arun.

          Thanks,
          Mayank

          mayank_bansal Mayank Bansal added a comment - Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me. +1 Thanks Arun. Thanks, Mayank
          acmurthy Arun Murthy added a comment -

          Thanks for the reviews Tom & Mayank. I've just committed the small patch.

          acmurthy Arun Murthy added a comment - Thanks for the reviews Tom & Mayank. I've just committed the small patch.
          acmurthy Arun Murthy added a comment -

          I just merged this to branch-1.1 after Matt's go ahead.

          acmurthy Arun Murthy added a comment - I just merged this to branch-1.1 after Matt's go ahead.
          mattf Matthew Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          mattf Matthew Foley added a comment - Closed upon release of Hadoop-1.1.0.
          mattf Matthew Foley added a comment - - edited

          It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix.
          The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.

          mattf Matthew Foley added a comment - - edited It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix. The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.

          Hi,

          Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff.
          I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch.
          I believe 1.0.2 is descendent of 0.20.2.

          So, please let me know does any of this patch will work for 1.0.2 or not.

          Regards,
          Manish

          manish_malhotra Manish Malhotra added a comment - Hi, Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff. I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch. I believe 1.0.2 is descendent of 0.20.2. So, please let me know does any of this patch will work for 1.0.2 or not. Regards, Manish

          People

            mayank_bansal Mayank Bansal
            mayank_bansal Mayank Bansal
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: