Details
Description
If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job.
However the current behavior is as follows
jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job.
Thanks,
Mayank
Attachments
Attachments
- PATCH-TRUNK-MAPREDUCE-3837.patch
- 1 kB
- Mayank Bansal
- PATCH-MAPREDUCE-3837.patch
- 1 kB
- Mayank Bansal
- PATCH-HADOOP-1-MAPREDUCE-3837-4.patch
- 40 kB
- Mayank Bansal
- PATCH-HADOOP-1-MAPREDUCE-3837-3.patch
- 31 kB
- Mayank Bansal
- PATCH-HADOOP-1-MAPREDUCE-3837-2.patch
- 18 kB
- Mayank Bansal
- PATCH-HADOOP-1-MAPREDUCE-3837-1.patch
- 18 kB
- Mayank Bansal
- PATCH-HADOOP-1-MAPREDUCE-3837.patch
- 18 kB
- Mayank Bansal
- MAPREDUCE-3837_addendum.patch
- 0.6 kB
- Arun Murthy
Activity
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12514029/PATCH-TRUNK-MAPREDUCE-3837.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.yarn.util.TestLinuxResourceCalculatorPlugin
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1832//console
This message is automatically generated.
+1 The patch looks good. It enables an important feature of automatic job recovery on JT startup.
Integrated in Hadoop-Hdfs-trunk-Commit #1797 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1797/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :
- /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Common-0.23-Commit #546 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/546/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Hdfs-0.23-Commit #534 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/534/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Common-trunk-Commit #1723 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1723/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :
- /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Mapreduce-0.23-Commit #550 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/550/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)
Result = ABORTED
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Mapreduce-trunk-Commit #1734 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1734/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)
Result = ABORTED
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :
- /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Mapreduce-0.23-Build #195 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/195/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)
Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Mapreduce-22-branch #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/100/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243700)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243700
Files :
- /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt
- /hadoop/common/branches/branch-0.22/mapreduce/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Hdfs-trunk #955 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)
Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :
- /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Hdfs-0.23-Build #168 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698)
Result = FAILURE
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243698
Files :
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Integrated in Hadoop-Mapreduce-trunk #990 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/)
MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695)
Result = SUCCESS
shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1243695
Files :
- /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
- /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java
Attached the patch for Hadoop -1, please review that.
Thanks,
Mayank
Mayank,
- Built branch-1 with your patch
- Configured the cluster, run a job test is OK
- Configured the mapred-site.xml with 'mapred.jobtracker.restart.recover=true'
- Restarted the JT
- Created a IN data file in my HDFS home dir
- Submitted 5 wordcount jobs
bin/hadoop jar hadoop-*examples*jar wordcount IN OUT0 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT1 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT2 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT3 & bin/hadoop jar hadoop-*examples*jar wordcount IN OUT4 &
- Waited till they are all running
- Killed the JT
- Restarted the JT
The jobs are not recovered, and what I see in the logs is:
2012-03-02 08:55:22,164 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0001. Deleting it!! 2012-03-02 08:55:22,194 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0002. Deleting it!! 2012-03-02 08:55:22,204 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0003. Deleting it!! 2012-03-02 08:55:22,224 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0004. Deleting it!! 2012-03-02 08:55:22,236 INFO org.apache.hadoop.mapred.JobTracker: Found an incomplete job directory job_201203020852_0005. Deleting it!!
Am I missing some additional configuration?
-1 on committing to branch-1. We've had innumerable issues with this before, not a good idea for a stable branch.
Hi Alejandro
Thanks for your help testing this patch, I am really sorry about confusion as I missed one function in the patch. I have attached the new patch , tested it and it is working fine in my local environment. I am not sure how I missed that before.
Please let me know if you find any more issues with that.
Arun,
I believe the issues were in terms of recovering the jobs from the point they crashed. Here what I am doing is very simplistic approach. I am reading the job token file and resubmitting the jobs in case of crash and recover. I am not trying to recover from the point it left from the last run.
In this scenario it is a new run of the job and works well. The downside is the whole job will re run however the upside is Users don't need to resubmit the jobs.
Please let me know your thoughts.
Thanks,
Mayank
I've tested the last patch and works as expected. I'd agree with Mayank that this approach (rerun the full job) seems much less risky than the previous approach (rerun from where it was left). Thus I'm good with the patch as it is much better than what currently is in.
Arun, would you reconsider based on the explanation of what Mayank's patch does?
I've been reviewing this patch, and have a couple of cosmetic comments below.
I agree with Alejandro. This is not introducing new feature, it is just enabling already existing feature. There is low risk, since the feature is enabled in a restricted context, that is restarting failed jobs from scratch rather than trying to continue from the point they were terminated.
The patch seems to be larger than it actually is, because it is removing the [troubled] logic responsible for resurrecting the job from its history. Besides that it is simple. Take a look, Arun.
Cosmetic comments
- Several lines are too long
- See several tabs - should be spaces
- indentation is wrong in couple of places
recoveryManager.addJobForRecovery(JobID.forName(fileName));
shouldRecover = true; // enable actual recovery if num-files > 1 - Add spaces after commas in method calls and parameters
Otherwise it looks good.
Apologies for the late response, I missed this.
Thanks for the clarification Mayank, Tucu & Konst. I agree it's much more palatable without all the complexities of trying to recover jobs from point-of-crash.
Couple of questions:
a) How does it work in a secure setting?
b) We should at least add some docs on this feature.
Makes sense?
Thanks Arun for your reply.
a) It reads the user id from the job token stored into the system directory and submits the job as that user, so the actual job runs as that user.
b) Yeah you are right, I will add the documentation and append it to the patch.
Thanks,
Mayank
TestRecoveryManager and TestJobTrackerRestartWithLostTracker failed for me with this patch. Mayank - can you update them for this JIRA please?
When I put this patch it did not have this issue,Let me update the patch.
Thanks for finding this out.
Thanks,
Mayank
Arun: I noticed this is listed as one of the patches in HDP. Does that imply that you're removing your -1? Or do you have a new patch that you're shipping in your product that you haven't open-sourced yet?
Hi Todd,
Arun gave -1 because he was in impression that I m trying to restore the state however when I explained it is not restore it is resubmit then he was OK.
What Arun told me more or less the patch is the same in HDP but one bug fix which he did.
I will update the patch based on Tom's comment.
Arun can you also put the bug fix which you did ?
Thanks,
Mayank
Hi Tom,
I just took the latest 1.1 code base and ran the two testcases which you mentioned abobe, without my patch and they are still failing.
Thanks,
Mayank
Mayank - thanks for pointing that out. I just tried and they fail for me on the latest branch-1 code too. We do need tests for job tracker recovery though, so they should be fixed to ensure that the code in this patch is tested and doesn't regress, don't you think?
Mayank, as we briefly discussed you'll need to fix the re-submit to read jobtokens from HDFS and pass them along (i.e. Credentials object) to the submitJob api. Sorry, I've been traveling a lot and missed commenting here, my bad.
Other nits:
- You've removed the call to JobClient.isJobDirValid which is dangerous. Since the contents have changed in hadoop-1 post security, please add a private isJobDirValid method to the JT and use it. This method should check for jobInfo file on HDFS (JobTracker.JOB_INFO_FILE) and the jobTokens file (TokenCache.JOB_TOKEN_HDFS_FILE).
- Also, since we only care about jobIds now for JT recovery, it's better to add a Set<JobId> jobIdsToRecover rather than rely on Set<JobInfo> jobsToRecover. This way we can avoid all the unnecessary translations b/w o.a.h.mapred.JobId and o.a.h.mapreduce.JobId.
Hi Arun,
As suggested by you
1) I added the credentials to resubmit api.
2) I added the isJobdirvalid api as well.
3) my patch already uses jobid instead of jobinfo so no change required.
Hi Tom,
I added the new test case and fixed the recoverymanager test case well in the latest patch.
I fixed one more issue in terms of recovery which i found here in production.
Please review the patch.
Thanks,
Mayank
Mayank - thanks for the changes. Here's my feedback:
- If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?
- In JobTracker you changed "shouldRecover = false;" to "shouldRecover = true;" without updating the comment on the line before. (This might be related to the previous point about not having restart counts.)
- Remove the @Ignore annotation from TestRecoveryManager and the comment about
MAPREDUCE-873. - The new test testJobresubmission (should be testJobResubmission) should test that the job succeeded after the restart. Also, there's no reason to run it as a high-priority job.
- There's a comment saying it is a "faulty job" - which it isn't.
- Have setUp and tearDown methods to start and stop the cluster. At the moment there is code duplication, and clusters won't be shut down cleanly on failure.
- testJobTracker would be better named testJobTrackerRestartsWithMissingJobFile
- testRecoveryManager would be better named testJobTrackerRestartWithBadJobs
- There are multiple typos and formatting errors (including indentation, which should be 2 spaces) in the new code. See Konstantin's comment above.
- TestJobTrackerRestartWithLostTracker still fails, as does TestJobTrackerSafeMode. These should be fixed as a part of this work.
Thanks Tom for your comments. I incorporated everything except below point
If there is no need for restart count anymore - since jobs are re-run from the beginning each time - then would it be cleaner to remove it entirely?
Yeah you are right and we should cleanup the restart count, However it looks to me it needs to be looked at more closely and more testing required. Do you mind If I open a separate JIRA and work on that separately then this JIRA?
Rest of the comments are incorporated in my latest patch.
Thanks,
Mayank
Attaching latest patch after incorporating Tom's comments.
Thanks,
Mayank
+1 to the latest patch - thanks for addressing my feedback Mayank. Can you run test-patch and the unit test if you haven't already please.
Cleaning up the restart count code in a separate JIRA is fine by me.
Test Patch Results are as follows:
[exec] BUILD SUCCESSFUL
[exec] Total time: 4 minutes 7 seconds
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 9 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================
I just now completed commit-tests successfully.
I ran all unit test previously before attaching the patch those as well completed successfully.
Thanks,
Mayank
I see this on a single node cluster.
Without this patch, tasks which are re-run fail with:
2012-07-11 05:43:18,299 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201207110542_0001_m_000000_0: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Creation of /tmp/hadoop-acmurthy/mapred/local/userlogs/job_201207110542_0001/attempt_201207110542_0001_m_000000_0 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)
The problem is that mkdirs (at least on mac-osx) returns false if the directory exists and wasn't created during the call.
Straight-fwd patch to check for existence fixes it.
Looks like this needs a minor update to get it to work on Mac OSX...
Could be any single-node cluster too...
+1 to the fix. FWIW I didn't see this when testing on a single-node cluster (on Mac OS X).
Even I did not see this when testing to my single node cluster on MAC OSX, however fiz looks good to me.
+1 Thanks Arun.
Thanks,
Mayank
Thanks for the reviews Tom & Mayank. I've just committed the small patch.
It seems that the merge to branch-1.1 on 25/Sep/12, which went into 1.1.0, only included the base fix.
The addendum from Arun was merged to branch-1.1 on 06/Dec/12 and will be part of release 1.1.2.
Hi,
Thanks for this patch, its very important, in future I think with ZK would be great for JT failover stuff.
I need some help for patching this. I'm using hadoop 1.0.2 and want to apply this patch.
I believe 1.0.2 is descendent of 0.20.2.
So, please let me know does any of this patch will work for 1.0.2 or not.
Regards,
Manish
PATCH-
MAPREDUCE-3837.patchthis one is for 22 branch. Please review that. Shortly I will be putting the same for trunk as well.
Thanks,
Mayank