[MAPREDUCE-3837] Job tracker is not able to recover job in case of crash and after that no user can submit job. - ASF JIRA

Mayank Bansal added a comment - 09/Feb/12 22:11

this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well.

Thanks,
Mayank

Mayank Bansal added a comment - 09/Feb/12 22:11 PATCH- MAPREDUCE-3837 .patch this one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well. Thanks, Mayank

Hadoop QA added a comment - 09/Feb/12 23:07

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests:
org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console

This message is automatically generated.

Hadoop QA added a comment - 09/Feb/12 23:07 -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console This message is automatically generated.

Konstantin Shvachko added a comment - 13/Feb/12 20:59

+1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.

Konstantin Shvachko added a comment - 13/Feb/12 20:59 +1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.

Konstantin Shvachko added a comment - 13/Feb/12 21:22

I just committed this. Thank you Mayank.

Konstantin Shvachko added a comment - 13/Feb/12 21:22 I just committed this. Thank you Mayank.

Hudson added a comment - 13/Feb/12 21:25

Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :

/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:25 Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:25

Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :

/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:25 Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:27

Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :

/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:27 Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Mahadev Konar added a comment - 13/Feb/12 21:29

@Mayank,
You should Grant license to Apache when uploading patches.

Mahadev Konar added a comment - 13/Feb/12 21:29 @Mayank, You should Grant license to Apache when uploading patches.

Hudson added a comment - 13/Feb/12 21:30

Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :

/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 21:30 Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:06

Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

Result = ABORTED
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :

/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:06 Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:08

Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

Result = ABORTED
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :

/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:08 Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:12

Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :

/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 13/Feb/12 22:12 Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 01:14

Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700
Files :

/hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt
/hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 01:14 Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700 Files : /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 12:34

Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :

/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 12:34 Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 12:36

Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)

Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :

/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 12:36 Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 13:56

Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/)
~~MAPREDUCE-3837~~. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)

Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :

/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
/hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Hudson added a comment - 14/Feb/12 13:56 Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/ ) MAPREDUCE-3837 . Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java

Mayank Bansal added a comment - 27/Feb/12 19:46

For Haddop-1 Patch

Mayank Bansal added a comment - 27/Feb/12 19:46 For Haddop-1 Patch

Mayank Bansal added a comment - 27/Feb/12 19:48

Attached the patch for Hadoop -1, please review that.

Thanks,
Mayank

Mayank Bansal added a comment - 27/Feb/12 19:48 Attached the patch for Hadoop -1, please review that. Thanks, Mayank

Alejandro Abdelnur added a comment - 02/Mar/12 17:01

Mayank,

Built branch-1 with your patch
Configured the cluster, run a job test is OK
Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true'
Restarted the JT
Created a IN data file in my HDFS home dir
Submitted 5 wordcount jobs

bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 &
bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 &
bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 &
bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 &
bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 &

Waited till they are all running
Killed the JT
Restarted the JT

The jobs are not recovered, and what I see in the logs is:

2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!!
2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!!
2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!!
2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!!
2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!!

Am I missing some additional configuration?

Alejandro Abdelnur added a comment - 02/Mar/12 17:01 Mayank, Built branch-1 with your patch Configured the cluster, run a job test is OK Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true' Restarted the JT Created a IN data file in my HDFS home dir Submitted 5 wordcount jobs bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 & Waited till they are all running Killed the JT Restarted the JT The jobs are not recovered, and what I see in the logs is: 2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!! 2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!! 2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!! 2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!! 2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!! Am I missing some additional configuration?

Arun Murthy added a comment - 02/Mar/12 18:49

-1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.

Arun Murthy added a comment - 02/Mar/12 18:49 -1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.

Mayank Bansal added a comment - 02/Mar/12 20:42

Hi Alejandro

Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before.

Please let me know if you find any more issues with that.

Arun,

I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run.

In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs.

Please let me know your thoughts.

Thanks,
Mayank

Mayank Bansal added a comment - 02/Mar/12 20:42 Hi Alejandro Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before. Please let me know if you find any more issues with that. Arun, I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run. In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs. Please let me know your thoughts. Thanks, Mayank

Alejandro Abdelnur added a comment - 02/Mar/12 21:16

I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in.

Arun, would you reconsider based on the explanation of what Mayank's patch does?

Alejandro Abdelnur added a comment - 02/Mar/12 21:16 I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in. Arun, would you reconsider based on the explanation of what Mayank's patch does?

Konstantin Shvachko added a comment - 02/Mar/12 23:17

I've been reviewing this patch, and have a couple of cosmetic comments below.
I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated.
The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun.

Cosmetic comments

Several lines are too long
See several tabs - should be spaces
indentation is wrong in couple of places
recoveryManager.addJobForRecovery(JobID.forName(fileName));
shouldRecover = true; // enable actual recovery if num-files > 1
Add spaces after commas in method calls and parameters
Otherwise it looks good.

Konstantin Shvachko added a comment - 02/Mar/12 23:17 I've been reviewing this patch, and have a couple of cosmetic comments below. I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated. The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun. Cosmetic comments Several lines are too long See several tabs - should be spaces indentation is wrong in couple of places recoveryManager.addJobForRecovery(JobID.forName(fileName)); shouldRecover = true; // enable actual recovery if num-files > 1 Add spaces after commas in method calls and parameters Otherwise it looks good.

Mayank Bansal added a comment - 05/Mar/12 22:26

Incorporating review comments

Mayank Bansal added a comment - 05/Mar/12 22:26 Incorporating review comments

Arun Murthy added a comment - 12/Mar/12 22:46

Apologies for the late response, I missed this.

Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash.

Couple of questions:
a) How does it work in a secure setting?
b) We should at least add some docs on this feature.

Makes sense?

Arun Murthy added a comment - 12/Mar/12 22:46 Apologies for the late response, I missed this. Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash. Couple of questions: a) How does it work in a secure setting? b) We should at least add some docs on this feature. Makes sense?

Mayank Bansal added a comment - 13/Mar/12 18:35

Thanks Arun for your reply.

a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user.
b) Yeah you are right, I will add the documentation and append it to the patch.

Thanks,
Mayank

Mayank Bansal added a comment - 13/Mar/12 18:35 Thanks Arun for your reply. a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user. b) Yeah you are right, I will add the documentation and append it to the patch. Thanks, Mayank

Thomas White added a comment - 20/Jun/12 21:20

TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?

Thomas White added a comment - 20/Jun/12 21:20 TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?

Mayank Bansal added a comment - 21/Jun/12 02:58

When I put this patch it did not have this issue,Let me update the patch.
Thanks for finding this out.

Thanks,
Mayank

Mayank Bansal added a comment - 21/Jun/12 02:58 When I put this patch it did not have this issue,Let me update the patch. Thanks for finding this out. Thanks, Mayank

Todd Lipcon added a comment - 23/Jun/12 19:42

Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?

Todd Lipcon added a comment - 23/Jun/12 19:42 Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?

Mayank Bansal added a comment - 23/Jun/12 19:53

Hi Todd,

Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK.

What Arun told me more or less the patch is the same in HDP but one bug fix which he did.

I will update the patch based on Tom's comment.

Arun can you also put the bug fix which you did ?

Thanks,
Mayank

Mayank Bansal added a comment - 23/Jun/12 19:53 Hi Todd, Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK. What Arun told me more or less the patch is the same in HDP but one bug fix which he did. I will update the patch based on Tom's comment. Arun can you also put the bug fix which you did ? Thanks, Mayank

Mayank Bansal added a comment - 25/Jun/12 19:27

Hi Tom,

I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing.

Thanks,
Mayank

Mayank Bansal added a comment - 25/Jun/12 19:27 Hi Tom, I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing. Thanks, Mayank

Thomas White added a comment - 25/Jun/12 21:02

Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?

Thomas White added a comment - 25/Jun/12 21:02 Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?

Mayank Bansal added a comment - 25/Jun/12 21:07

Agree, working on it will update soon.

Thanks,
Mayank

Mayank Bansal added a comment - 25/Jun/12 21:07 Agree, working on it will update soon. Thanks, Mayank

Arun Murthy added a comment - 25/Jun/12 21:22

Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad.

Other nits:

You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE).
Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.

Arun Murthy added a comment - 25/Jun/12 21:22 Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad. Other nits: You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE). Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.

Mayank Bansal added a comment - 26/Jun/12 22:18

Hi Arun,

As suggested by you

1) I added the credentials to resubmit api.
2) I added the isJobdirvalid api as well.
3) my patch already uses jobid instead of jobinfo so no change required.

Hi Tom,

I added the new test case and fixed the recoverymanager test case well in the latest patch.

I fixed one more issue in terms of recovery which i found here in production.

Please review the patch.

Thanks,
Mayank

Mayank Bansal added a comment - 26/Jun/12 22:18 Hi Arun, As suggested by you 1) I added the credentials to resubmit api. 2) I added the isJobdirvalid api as well. 3) my patch already uses jobid instead of jobinfo so no change required. Hi Tom, I added the new test case and fixed the recoverymanager test case well in the latest patch. I fixed one more issue in terms of recovery which i found here in production. Please review the patch. Thanks, Mayank

Thomas White added a comment - 27/Jun/12 19:25

Mayank - thanks for the changes. Here's my feedback:

If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?
In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.)
Remove the @Ignore annotation from TestRecoveryManager and the comment about ~~MAPREDUCE-873~~.
The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job.
There's a comment saying it is a "faulty job" - which it isn't.
Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure.
testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile
testRecoveryManager would be better named testJobTrackerRestartWithBadJobs
There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above.
TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.

Thomas White added a comment - 27/Jun/12 19:25 Mayank - thanks for the changes. Here's my feedback: If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.) Remove the @Ignore annotation from TestRecoveryManager and the comment about MAPREDUCE-873 . The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job. There's a comment saying it is a "faulty job" - which it isn't. Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure. testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile testRecoveryManager would be better named testJobTrackerRestartWithBadJobs There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above. TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.

Mayank Bansal added a comment - 29/Jun/12 21:49

Thanks Tom for your comments. I incorporated everything except below point

If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?

Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA?

Rest of the comments are incorporated in my latest patch.

Thanks,
Mayank

Mayank Bansal added a comment - 29/Jun/12 21:49 Thanks Tom for your comments. I incorporated everything except below point If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely? Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA? Rest of the comments are incorporated in my latest patch. Thanks, Mayank

Mayank Bansal added a comment - 29/Jun/12 21:50

Attaching latest patch after incorporating Tom's comments.

Thanks,
Mayank

Mayank Bansal added a comment - 29/Jun/12 21:50 Attaching latest patch after incorporating Tom's comments. Thanks, Mayank

Thomas White added a comment - 02/Jul/12 16:12

+1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please.

Cleaning up the restart count code in a separate JIRA is fine by me.

Thomas White added a comment - 02/Jul/12 16:12 +1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please. Cleaning up the restart count code in a separate JIRA is fine by me.

Mayank Bansal added a comment - 02/Jul/12 20:43

Test Patch Results are as follows:

[exec] BUILD SUCCESSFUL
[exec] Total time: 4 minutes 7 seconds
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 9 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================

Mayank Bansal added a comment - 02/Jul/12 20:43 Test Patch Results are as follows: [exec] BUILD SUCCESSFUL [exec] Total time: 4 minutes 7 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ======================================================================

Mayank Bansal added a comment - 02/Jul/12 21:33

I just now completed commit-tests successfully.
I ran all unit test previously before attaching the patch those as well completed successfully.

Thanks,
Mayank

Mayank Bansal added a comment - 02/Jul/12 21:33 I just now completed commit-tests successfully. I ran all unit test previously before attaching the patch those as well completed successfully. Thanks, Mayank

Thomas White added a comment - 03/Jul/12 20:09

I just committed this to branch-1. Thanks Mayank!

Thomas White added a comment - 03/Jul/12 20:09 I just committed this to branch-1. Thanks Mayank!

Arun Murthy added a comment - 11/Jul/12 12:47

Looks like this needs a minor update to get it to work on Mac OSX...

Arun Murthy added a comment - 11/Jul/12 12:47 Looks like this needs a minor update to get it to work on Mac OSX...

Arun Murthy added a comment - 11/Jul/12 12:51

I see this on a single node cluster.

Without this patch, tasks which are re-run fail with:

2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed.
	at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104)
	at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
	at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)

The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call.

Straight-fwd patch to check for existence fixes it.

Arun Murthy added a comment - 11/Jul/12 12:51 I see this on a single node cluster. Without this patch, tasks which are re-run fail with: 2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call. Straight-fwd patch to check for existence fixes it.

Arun Murthy added a comment - 11/Jul/12 13:54

Looks like this needs a minor update to get it to work on Mac OSX...

Could be any single-node cluster too...

Arun Murthy added a comment - 11/Jul/12 13:54 Looks like this needs a minor update to get it to work on Mac OSX... Could be any single-node cluster too...

Thomas White added a comment - 11/Jul/12 15:17

+1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).

Thomas White added a comment - 11/Jul/12 15:17 +1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).

Mayank Bansal added a comment - 11/Jul/12 15:22

Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me.

+1 Thanks Arun.

Thanks,
Mayank

Mayank Bansal added a comment - 11/Jul/12 15:22 Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me. +1 Thanks Arun. Thanks, Mayank

Arun Murthy added a comment - 11/Jul/12 16:05

Thanks for the reviews Tom & Mayank. I've just committed the small patch.

Arun Murthy added a comment - 11/Jul/12 16:05 Thanks for the reviews Tom & Mayank. I've just committed the small patch.

Arun Murthy added a comment - 25/Sep/12 17:24

I just merged this to branch-1.1 after Matt's go ahead.

Arun Murthy added a comment - 25/Sep/12 17:24 I just merged this to branch-1.1 after Matt's go ahead.

Matthew Foley added a comment - 17/Oct/12 18:27

Closed upon release of Hadoop-1.1.0.

Matthew Foley added a comment - 17/Oct/12 18:27 Closed upon release of Hadoop-1.1.0.

Matthew Foley added a comment - 07/Dec/12 02:48 - edited

It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix.
The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.

Matthew Foley added a comment - 07/Dec/12 02:48 - edited It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix. The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.

Manish Malhotra added a comment - 22/Jan/13 23:29

Hi,

Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff.
I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch.
I believe 1.0.2 is descendent of 0.20.2.

So, please let me know does any of this patch will work for 1.0.2 or not.

Regards,
Manish

Manish Malhotra added a comment - 22/Jan/13 23:29 Hi, Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff. I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch. I believe 1.0.2 is descendent of 0.20.2. So, please let me know does any of this patch will work for 1.0.2 or not. Regards, Manish

Hadoop Map/Reduce

Job tracker is not able to recover job in case of crash and after that no user can submit job.

Details

Description

Attachments

Attachments

Activity

People

Dates