Issue Details (XML | Word | Printable)

Key: HADOOP-4759
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Amareshwari Sriramadasu
Reporter: Amareshwari Sriramadasu
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

HADOOP-4654 to be fixed for branches >= 0.19

Created: 03/Dec/08 07:03 AM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: 0.19.0
Fix Version/s: 0.19.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch-4759-0.19.txt 2009-01-15 07:52 AM Amareshwari Sriramadasu 87 kB
Text File Licensed for inclusion in ASF works patch-4759-0.20.txt 2009-02-04 11:19 AM Amareshwari Sriramadasu 113 kB
Text File Licensed for inclusion in ASF works patch-4759-1-0.19.txt 2009-02-04 11:18 AM Amareshwari Sriramadasu 112 kB
Text File Licensed for inclusion in ASF works patch-4759-1-0.20.txt 2009-02-05 12:35 PM Amareshwari Sriramadasu 116 kB
Text File Licensed for inclusion in ASF works patch-4759-1.txt 2008-12-31 11:45 AM Amareshwari Sriramadasu 80 kB
Text File Licensed for inclusion in ASF works patch-4759-2-0.19.txt 2009-02-05 12:34 PM Amareshwari Sriramadasu 115 kB
Text File Licensed for inclusion in ASF works patch-4759-2.txt 2009-01-13 10:13 AM Amareshwari Sriramadasu 88 kB
Text File Licensed for inclusion in ASF works patch-4759-3.1.txt 2009-02-04 11:27 AM Amareshwari Sriramadasu 116 kB
Text File Licensed for inclusion in ASF works patch-4759-3.2.txt 2009-02-05 12:36 PM Amareshwari Sriramadasu 119 kB
Text File Licensed for inclusion in ASF works patch-4759-3.txt 2009-02-03 02:02 PM Amareshwari Sriramadasu 113 kB
Text File Licensed for inclusion in ASF works patch-4759.txt 2008-12-24 11:55 AM Amareshwari Sriramadasu 58 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 05/Feb/09 05:44 PM


 Description  « Hide
Since HADOOP-4654 is fixed only for branch 18.3. This jira looks at the issue reported for 0.19 and above branches

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amareshwari Sriramadasu added a comment - 10/Dec/08 11:58 AM
After discussion with Devaraj and Owen, I summarize the approach here:
  • Child.java can have cleanup code in finally block. This will make sure that the cleanup will happen if the failure is because of Exception/Error, this will cover a majority of cases.
  • Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN. JobTracker will launch a separate cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED
  • JT stops launching cleanup tasks for attempts once job succeeds/fails. As Devaraj told, this also means that the job level cleanup task (OutputCommitted.cleanupJob) has run, with the assumption that the job level cleanup has cleaned all garbage up.

Two approches here:
1. We can use the same attempt for launching the cleanup. Here, the same attempt will launched with starting state as *_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. If it fails, it will be left in *_UNCLEAN state.
We would need additional logic for scheduler to handle retries, if needed.

2. Have a separate tip for doing the cleanup. Associate the cleanup tip with failed/killed attempt, by passing the attempt_id through configuration.
Once the tip succeeds ( after four retry attempts, by default), it will move the corresponding attempt to FAILED or KILLED. If the tip fails, it will leave the attempt in *_UNCLEAN state.

Thoughts?


Owen O'Malley added a comment - 11/Dec/08 03:33 AM
I'd vote for #1, assuming that you mean KILLED_UNCLEAN and FAILED_UNCLEAN. I think that is a little cleaner.

Amareshwari Sriramadasu added a comment - 24/Dec/08 11:55 AM
Patch for trunk

Amareshwari Sriramadasu added a comment - 27/Dec/08 04:58 AM
Patch implemented with proposed design.

In summary patch does,

  • Added cleanup code in finally block of Child.java.
  • Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN.
  • JobTracker will launch the attempt as cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED. The attempt will launched with starting state as FAILED_UNCLEAN/KILLED_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED.
  • There are no retries of cleanup if it fails
  • JT stops launching cleanup tasks for attempts once job succeeds/fails.

Amareshwari Sriramadasu added a comment - 31/Dec/08 11:44 AM
Attaching patch for review.

Fixed a couple of bugs in earlier patch. This patch has been tested.


Amareshwari Sriramadasu added a comment - 09/Jan/09 03:57 AM
cancelling patch to incorporate offline comments from Devaraj and Sharad.
Comments include :
1. move umbilical.statusUpdate() with re-tries to an api in Task.java , and make sure it is called after every setPhase to reflect the change in Phase immediately.
2. taskTrackerPort should not be -1 for lost tasktracker in JobHistory
3. distinguish explicitly JobCleanup and TaskCleanup in all the methods.
4. Add more comments
5. rename TaskTracker.getLocalTaskOutputDir to TaskTracker.getIntermediateOutputDir
6. rename TaskStatus.isCleaningup to TaskStatus.inTaskCleanupPhase()
7. Remove parameter state from taskInProgress constructor
8. remove unreachable code in statusupdate

Amareshwari Sriramadasu added a comment - 13/Jan/09 10:13 AM
Attaching patch with review comments incorporated.

Amareshwari Sriramadasu added a comment - 13/Jan/09 10:15 AM
test-patch has no warnings. All core and contrib tests passed on my machine.
Tested the patch well on big cluster.

Amareshwari Sriramadasu added a comment - 15/Jan/09 07:52 AM
Patch for branch 0.19

All core and contrib tests, except TestFileOutputFormat and TestHarFileSystem, pass on branch 0.19. The failures were due to HADOOP-5002


Amareshwari Sriramadasu added a comment - 15/Jan/09 07:53 AM
patch-4759-2.txt applies to both trunk and 0.20
All core and contrib tests pass on branch 0.20 as well.

Devaraj Das added a comment - 31/Jan/09 04:21 AM
Comments:
1. Change the comment to JIP.updateTaskStatus with regard to job completion
2. Add a comment in TT to say that tasks reports its state as FAILED_UNCLEAN / KILLED_UNCLEAN / COMMIT_PENDING, even though it is running
user code.
3. Don't re-use Task objects for the special tasks in the TIP. Instead create new tasks and have a mapping from the taskId to the type of the special task. For example, isCleanupAttempt(taskid) could just look at the mapping in the TIP and return true or false. Similarly for other such queries like isSetupTask(taskid).
4. swap the ordering of jobcleanup tasks and taskCleanup tasks in JT.getSetupAndCleanupTasks()
5. getSetupAndCleanupTasks should return multiple tasks in a heartbeat.
6. Change all method names like runCleanupJob and runCleanupTask to runJobCleanupTask and runTaskCleanupTask.
7. The new code in TaskTracker.taskFinished looks confusing. Please take a pass at it to make it more readable.

Devaraj Das added a comment - 31/Jan/09 04:33 AM
Also, in your patch, the state of the special tasks don't change when the tasks go from their initial state to the running state. For example, a FAILED_UNCLEAN task remains at this state until it is done. It would have been good to have a new state FAILED_UNCLEAN_RUNNING for a FAILED_UNCLEAN task when it is running (at the cost of additional state management and code). But we can defer this for now .. until we do HADOOP-4421.

Amareshwari Sriramadasu added a comment - 03/Feb/09 02:02 PM
Patch incorporating the review comments.

dhruba borthakur added a comment - 03/Feb/09 07:24 PM
It would be really helpful to get this one into 0.19 and trunk.

Amareshwari Sriramadasu added a comment - 04/Feb/09 11:27 AM
Attached patches for 0.19, 0.20 and trunk.

All unit tets passed on branches 0.19 and 0.20 and also trunk.


Hadoop QA added a comment - 04/Feb/09 08:02 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399448/patch-4759-3.1.txt
against trunk revision 740532.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/console

This message is automatically generated.


Amareshwari Sriramadasu added a comment - 05/Feb/09 03:45 AM

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

Warnings:
1. JvmRunner.runChild(JvmManager$JvmEnv) ignores exceptional return value of java.io.File.delete()
I would like to ignore the return value, it doesn't do anything with the return value.
2. Unread field: org.apache.hadoop.mapred.JvmManager$JvmManagerForType.tracker :This field is never read. Consider removing it from the class.
Though tracker reference is not used anywhere in JvmManager, I don't want to delete it from JvmManager, since it could be used later.

contrib-test failure org.apache.hadoop.chukwa.datacollection.agent.TestAgentConfig.testInitAdaptors_vs_Checkpoint is not related to the patch.


Amareshwari Sriramadasu added a comment - 05/Feb/09 12:36 PM
Attaching patch after fixing findbugs warnings, also added some comments

Amareshwari Sriramadasu added a comment - 05/Feb/09 12:37 PM
test-patch result:
     [exec]
     [exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
     [exec]

Hadoop QA added a comment - 05/Feb/09 04:36 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399563/patch-4759-3.2.txt
against trunk revision 741009.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/console

This message is automatically generated.


Devaraj Das added a comment - 05/Feb/09 05:44 PM
I just committed this. Thanks, Amareshwari!

Hudson added a comment - 16/Feb/09 05:00 PM