Hadoop Common
  1. Hadoop Common
  2. HADOOP-4759

HADOOP-4654 to be fixed for branches >= 0.19

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Since HADOOP-4654 is fixed only for branch 18.3. This jira looks at the issue reported for 0.19 and above branches

      1. patch-4759-3.2.txt
        119 kB
        Amareshwari Sriramadasu
      2. patch-4759-1-0.20.txt
        116 kB
        Amareshwari Sriramadasu
      3. patch-4759-2-0.19.txt
        115 kB
        Amareshwari Sriramadasu
      4. patch-4759-3.1.txt
        116 kB
        Amareshwari Sriramadasu
      5. patch-4759-0.20.txt
        113 kB
        Amareshwari Sriramadasu
      6. patch-4759-1-0.19.txt
        112 kB
        Amareshwari Sriramadasu
      7. patch-4759-3.txt
        113 kB
        Amareshwari Sriramadasu
      8. patch-4759-0.19.txt
        87 kB
        Amareshwari Sriramadasu
      9. patch-4759-2.txt
        88 kB
        Amareshwari Sriramadasu
      10. patch-4759-1.txt
        80 kB
        Amareshwari Sriramadasu
      11. patch-4759.txt
        58 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Amareshwari Sriramadasu added a comment -

          After discussion with Devaraj and Owen, I summarize the approach here:

          • Child.java can have cleanup code in finally block. This will make sure that the cleanup will happen if the failure is because of Exception/Error, this will cover a majority of cases.
          • Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN. JobTracker will launch a separate cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED
          • JT stops launching cleanup tasks for attempts once job succeeds/fails. As Devaraj told, this also means that the job level cleanup task (OutputCommitted.cleanupJob) has run, with the assumption that the job level cleanup has cleaned all garbage up.

          Two approches here:
          1. We can use the same attempt for launching the cleanup. Here, the same attempt will launched with starting state as *_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. If it fails, it will be left in *_UNCLEAN state.
          We would need additional logic for scheduler to handle retries, if needed.

          2. Have a separate tip for doing the cleanup. Associate the cleanup tip with failed/killed attempt, by passing the attempt_id through configuration.
          Once the tip succeeds ( after four retry attempts, by default), it will move the corresponding attempt to FAILED or KILLED. If the tip fails, it will leave the attempt in *_UNCLEAN state.

          Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - After discussion with Devaraj and Owen, I summarize the approach here: Child.java can have cleanup code in finally block. This will make sure that the cleanup will happen if the failure is because of Exception/Error, this will cover a majority of cases. Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN. JobTracker will launch a separate cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED JT stops launching cleanup tasks for attempts once job succeeds/fails. As Devaraj told, this also means that the job level cleanup task (OutputCommitted.cleanupJob) has run, with the assumption that the job level cleanup has cleaned all garbage up. Two approches here: 1. We can use the same attempt for launching the cleanup. Here, the same attempt will launched with starting state as *_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. If it fails, it will be left in *_UNCLEAN state. We would need additional logic for scheduler to handle retries, if needed. 2. Have a separate tip for doing the cleanup. Associate the cleanup tip with failed/killed attempt, by passing the attempt_id through configuration. Once the tip succeeds ( after four retry attempts, by default), it will move the corresponding attempt to FAILED or KILLED. If the tip fails, it will leave the attempt in *_UNCLEAN state. Thoughts?
          Hide
          Owen O'Malley added a comment -

          I'd vote for #1, assuming that you mean KILLED_UNCLEAN and FAILED_UNCLEAN. I think that is a little cleaner.

          Show
          Owen O'Malley added a comment - I'd vote for #1, assuming that you mean KILLED_UNCLEAN and FAILED_UNCLEAN. I think that is a little cleaner.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for trunk

          Show
          Amareshwari Sriramadasu added a comment - Patch for trunk
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch implemented with proposed design.

          In summary patch does,

          • Added cleanup code in finally block of Child.java.
          • Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN.
          • JobTracker will launch the attempt as cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED. The attempt will launched with starting state as FAILED_UNCLEAN/KILLED_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED.
          • There are no retries of cleanup if it fails
          • JT stops launching cleanup tasks for attempts once job succeeds/fails.
          Show
          Amareshwari Sriramadasu added a comment - Patch implemented with proposed design. In summary patch does, Added cleanup code in finally block of Child.java. Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN. JobTracker will launch the attempt as cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED. The attempt will launched with starting state as FAILED_UNCLEAN/KILLED_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. There are no retries of cleanup if it fails JT stops launching cleanup tasks for attempts once job succeeds/fails.
          Hide
          Amareshwari Sriramadasu added a comment -

          Attaching patch for review.

          Fixed a couple of bugs in earlier patch. This patch has been tested.

          Show
          Amareshwari Sriramadasu added a comment - Attaching patch for review. Fixed a couple of bugs in earlier patch. This patch has been tested.
          Hide
          Amareshwari Sriramadasu added a comment -

          cancelling patch to incorporate offline comments from Devaraj and Sharad.
          Comments include :
          1. move umbilical.statusUpdate() with re-tries to an api in Task.java , and make sure it is called after every setPhase to reflect the change in Phase immediately.
          2. taskTrackerPort should not be -1 for lost tasktracker in JobHistory
          3. distinguish explicitly JobCleanup and TaskCleanup in all the methods.
          4. Add more comments
          5. rename TaskTracker.getLocalTaskOutputDir to TaskTracker.getIntermediateOutputDir
          6. rename TaskStatus.isCleaningup to TaskStatus.inTaskCleanupPhase()
          7. Remove parameter state from taskInProgress constructor
          8. remove unreachable code in statusupdate

          Show
          Amareshwari Sriramadasu added a comment - cancelling patch to incorporate offline comments from Devaraj and Sharad. Comments include : 1. move umbilical.statusUpdate() with re-tries to an api in Task.java , and make sure it is called after every setPhase to reflect the change in Phase immediately. 2. taskTrackerPort should not be -1 for lost tasktracker in JobHistory 3. distinguish explicitly JobCleanup and TaskCleanup in all the methods. 4. Add more comments 5. rename TaskTracker.getLocalTaskOutputDir to TaskTracker.getIntermediateOutputDir 6. rename TaskStatus.isCleaningup to TaskStatus.inTaskCleanupPhase() 7. Remove parameter state from taskInProgress constructor 8. remove unreachable code in statusupdate
          Hide
          Amareshwari Sriramadasu added a comment -

          Attaching patch with review comments incorporated.

          Show
          Amareshwari Sriramadasu added a comment - Attaching patch with review comments incorporated.
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch has no warnings. All core and contrib tests passed on my machine.
          Tested the patch well on big cluster.

          Show
          Amareshwari Sriramadasu added a comment - test-patch has no warnings. All core and contrib tests passed on my machine. Tested the patch well on big cluster.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for branch 0.19

          All core and contrib tests, except TestFileOutputFormat and TestHarFileSystem, pass on branch 0.19. The failures were due to HADOOP-5002

          Show
          Amareshwari Sriramadasu added a comment - Patch for branch 0.19 All core and contrib tests, except TestFileOutputFormat and TestHarFileSystem, pass on branch 0.19. The failures were due to HADOOP-5002
          Hide
          Amareshwari Sriramadasu added a comment -

          patch-4759-2.txt applies to both trunk and 0.20
          All core and contrib tests pass on branch 0.20 as well.

          Show
          Amareshwari Sriramadasu added a comment - patch-4759-2.txt applies to both trunk and 0.20 All core and contrib tests pass on branch 0.20 as well.
          Hide
          Devaraj Das added a comment -

          Comments:
          1. Change the comment to JIP.updateTaskStatus with regard to job completion
          2. Add a comment in TT to say that tasks reports its state as FAILED_UNCLEAN / KILLED_UNCLEAN / COMMIT_PENDING, even though it is running
          user code.
          3. Don't re-use Task objects for the special tasks in the TIP. Instead create new tasks and have a mapping from the taskId to the type of the special task. For example, isCleanupAttempt(taskid) could just look at the mapping in the TIP and return true or false. Similarly for other such queries like isSetupTask(taskid).
          4. swap the ordering of jobcleanup tasks and taskCleanup tasks in JT.getSetupAndCleanupTasks()
          5. getSetupAndCleanupTasks should return multiple tasks in a heartbeat.
          6. Change all method names like runCleanupJob and runCleanupTask to runJobCleanupTask and runTaskCleanupTask.
          7. The new code in TaskTracker.taskFinished looks confusing. Please take a pass at it to make it more readable.

          Show
          Devaraj Das added a comment - Comments: 1. Change the comment to JIP.updateTaskStatus with regard to job completion 2. Add a comment in TT to say that tasks reports its state as FAILED_UNCLEAN / KILLED_UNCLEAN / COMMIT_PENDING, even though it is running user code. 3. Don't re-use Task objects for the special tasks in the TIP. Instead create new tasks and have a mapping from the taskId to the type of the special task. For example, isCleanupAttempt(taskid) could just look at the mapping in the TIP and return true or false. Similarly for other such queries like isSetupTask(taskid). 4. swap the ordering of jobcleanup tasks and taskCleanup tasks in JT.getSetupAndCleanupTasks() 5. getSetupAndCleanupTasks should return multiple tasks in a heartbeat. 6. Change all method names like runCleanupJob and runCleanupTask to runJobCleanupTask and runTaskCleanupTask. 7. The new code in TaskTracker.taskFinished looks confusing. Please take a pass at it to make it more readable.
          Hide
          Devaraj Das added a comment -

          Also, in your patch, the state of the special tasks don't change when the tasks go from their initial state to the running state. For example, a FAILED_UNCLEAN task remains at this state until it is done. It would have been good to have a new state FAILED_UNCLEAN_RUNNING for a FAILED_UNCLEAN task when it is running (at the cost of additional state management and code). But we can defer this for now .. until we do HADOOP-4421.

          Show
          Devaraj Das added a comment - Also, in your patch, the state of the special tasks don't change when the tasks go from their initial state to the running state. For example, a FAILED_UNCLEAN task remains at this state until it is done. It would have been good to have a new state FAILED_UNCLEAN_RUNNING for a FAILED_UNCLEAN task when it is running (at the cost of additional state management and code). But we can defer this for now .. until we do HADOOP-4421 .
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch incorporating the review comments.

          Show
          Amareshwari Sriramadasu added a comment - Patch incorporating the review comments.
          Hide
          dhruba borthakur added a comment -

          It would be really helpful to get this one into 0.19 and trunk.

          Show
          dhruba borthakur added a comment - It would be really helpful to get this one into 0.19 and trunk.
          Hide
          Amareshwari Sriramadasu added a comment -

          Attached patches for 0.19, 0.20 and trunk.

          All unit tets passed on branches 0.19 and 0.20 and also trunk.

          Show
          Amareshwari Sriramadasu added a comment - Attached patches for 0.19, 0.20 and trunk. All unit tets passed on branches 0.19 and 0.20 and also trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12399448/patch-4759-3.1.txt
          against trunk revision 740532.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399448/patch-4759-3.1.txt against trunk revision 740532. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3797/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          Warnings:
          1. JvmRunner.runChild(JvmManager$JvmEnv) ignores exceptional return value of java.io.File.delete()
          I would like to ignore the return value, it doesn't do anything with the return value.
          2. Unread field: org.apache.hadoop.mapred.JvmManager$JvmManagerForType.tracker :This field is never read. Consider removing it from the class.
          Though tracker reference is not used anywhere in JvmManager, I don't want to delete it from JvmManager, since it could be used later.

          contrib-test failure org.apache.hadoop.chukwa.datacollection.agent.TestAgentConfig.testInitAdaptors_vs_Checkpoint is not related to the patch.

          Show
          Amareshwari Sriramadasu added a comment - -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. Warnings: 1. JvmRunner.runChild(JvmManager$JvmEnv) ignores exceptional return value of java.io.File.delete() I would like to ignore the return value, it doesn't do anything with the return value. 2. Unread field: org.apache.hadoop.mapred.JvmManager$JvmManagerForType.tracker :This field is never read. Consider removing it from the class. Though tracker reference is not used anywhere in JvmManager, I don't want to delete it from JvmManager, since it could be used later. contrib-test failure org.apache.hadoop.chukwa.datacollection.agent.TestAgentConfig.testInitAdaptors_vs_Checkpoint is not related to the patch.
          Hide
          Amareshwari Sriramadasu added a comment -

          Attaching patch after fixing findbugs warnings, also added some comments

          Show
          Amareshwari Sriramadasu added a comment - Attaching patch after fixing findbugs warnings, also added some comments
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch result:

               [exec]
               [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec]
          
          Show
          Amareshwari Sriramadasu added a comment - test-patch result: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12399563/patch-4759-3.2.txt
          against trunk revision 741009.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399563/patch-4759-3.2.txt against trunk revision 741009. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3801/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Amareshwari!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #756 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/756/ )

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Amareshwari Sriramadasu
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development