|
I'd vote for #1, assuming that you mean KILLED_UNCLEAN and FAILED_UNCLEAN. I think that is a little cleaner.
Patch implemented with proposed design.
In summary patch does,
Attaching patch for review.
Fixed a couple of bugs in earlier patch. This patch has been tested. cancelling patch to incorporate offline comments from Devaraj and Sharad.
Comments include : 1. move umbilical.statusUpdate() with re-tries to an api in Task.java , and make sure it is called after every setPhase to reflect the change in Phase immediately. 2. taskTrackerPort should not be -1 for lost tasktracker in JobHistory 3. distinguish explicitly JobCleanup and TaskCleanup in all the methods. 4. Add more comments 5. rename TaskTracker.getLocalTaskOutputDir to TaskTracker.getIntermediateOutputDir 6. rename TaskStatus.isCleaningup to TaskStatus.inTaskCleanupPhase() 7. Remove parameter state from taskInProgress constructor 8. remove unreachable code in statusupdate Attaching patch with review comments incorporated.
test-patch has no warnings. All core and contrib tests passed on my machine.
Tested the patch well on big cluster. Patch for branch 0.19
All core and contrib tests, except TestFileOutputFormat and TestHarFileSystem, pass on branch 0.19. The failures were due to patch-4759-2.txt applies to both trunk and 0.20
All core and contrib tests pass on branch 0.20 as well. Comments:
1. Change the comment to JIP.updateTaskStatus with regard to job completion 2. Add a comment in TT to say that tasks reports its state as FAILED_UNCLEAN / KILLED_UNCLEAN / COMMIT_PENDING, even though it is running user code. 3. Don't re-use Task objects for the special tasks in the TIP. Instead create new tasks and have a mapping from the taskId to the type of the special task. For example, isCleanupAttempt(taskid) could just look at the mapping in the TIP and return true or false. Similarly for other such queries like isSetupTask(taskid). 4. swap the ordering of jobcleanup tasks and taskCleanup tasks in JT.getSetupAndCleanupTasks() 5. getSetupAndCleanupTasks should return multiple tasks in a heartbeat. 6. Change all method names like runCleanupJob and runCleanupTask to runJobCleanupTask and runTaskCleanupTask. 7. The new code in TaskTracker.taskFinished looks confusing. Please take a pass at it to make it more readable. Also, in your patch, the state of the special tasks don't change when the tasks go from their initial state to the running state. For example, a FAILED_UNCLEAN task remains at this state until it is done. It would have been good to have a new state FAILED_UNCLEAN_RUNNING for a FAILED_UNCLEAN task when it is running (at the cost of additional state management and code). But we can defer this for now .. until we do HADOOP-4421.
Patch incorporating the review comments.
It would be really helpful to get this one into 0.19 and trunk.
Attached patches for 0.19, 0.20 and trunk.
All unit tets passed on branches 0.19 and 0.20 and also trunk. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399448/patch-4759-3.1.txt against trunk revision 740532. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/testReport/ This message is automatically generated.
Warnings: contrib-test failure org.apache.hadoop.chukwa.datacollection.agent.TestAgentConfig.testInitAdaptors_vs_Checkpoint is not related to the patch. Attaching patch after fixing findbugs warnings, also added some comments
test-patch result:
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
[exec]
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399563/patch-4759-3.2.txt against trunk revision 741009. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/testReport/ This message is automatically generated. I just committed this. Thanks, Amareshwari!
Integrated in Hadoop-trunk #756 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/756/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Two approches here:
1. We can use the same attempt for launching the cleanup. Here, the same attempt will launched with starting state as *_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. If it fails, it will be left in *_UNCLEAN state.
We would need additional logic for scheduler to handle retries, if needed.
2. Have a separate tip for doing the cleanup. Associate the cleanup tip with failed/killed attempt, by passing the attempt_id through configuration.
Once the tip succeeds ( after four retry attempts, by default), it will move the corresponding attempt to FAILED or KILLED. If the tip fails, it will leave the attempt in *_UNCLEAN state.
Thoughts?