Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2207

Task-cleanup task should not be scheduled on the node that the task just failed

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: jobtracker
    • Labels:
      None
    • Release Note:
      Task-cleanup task should not be scheduled on the node that the task just failed

      Description

      Currently the task-cleanup task always go to the same node that the task just failed.
      There is a higher chance that it hits a bad node. This should be changed.

      1. ant-test.txt
        51 kB
        Liyin Liang
      2. 2207-3.diff
        4 kB
        Liyin Liang
      3. 2207-3.diff
        4 kB
        Liyin Liang
      4. 2207-2.diff
        4 kB
        Liyin Liang
      5. 2207-1.diff
        2 kB
        Liyin Liang
      6. 0.19.1.diff
        1 kB
        Liyin Liang

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/ )
          Hide
          Todd Lipcon added a comment -

          Hey guys. I noted this issue is marked in CHANGES.txt for 0.22, but it was only committed to trunk. Did you mean to commit it to the branch as a bug fix?

          Show
          Todd Lipcon added a comment - Hey guys. I noted this issue is marked in CHANGES.txt for 0.22, but it was only committed to trunk. Did you mean to commit it to the branch as a bug fix?
          Hide
          Nigel Daley added a comment -

          When test-patch is run locally (by a developer) it doesn't actually run the tests (as you can see from the output). You still need to run the test target.

          Show
          Nigel Daley added a comment - When test-patch is run locally (by a developer) it doesn't actually run the tests (as you can see from the output). You still need to run the test target.
          Hide
          Liyin Liang added a comment -

          Hi Todd,
          The patch committed is absolutely the same one i ran tests. Maybe I made some mistakes when ran "ant test". I'll work on MAPREDUCE-2271 to fix TestSetupTaskScheduling. By the way, I don't understand why the result of "ant test-patch" is +1.

          Show
          Liyin Liang added a comment - Hi Todd, The patch committed is absolutely the same one i ran tests. Maybe I made some mistakes when ran "ant test". I'll work on MAPREDUCE-2271 to fix TestSetupTaskScheduling. By the way, I don't understand why the result of "ant test-patch" is +1.
          Hide
          Todd Lipcon added a comment -

          It seems that TestSetupTaskScheduling is failing in trunk builds since this was committed. It fails for me on trunk and passes again if I revert this change locally.

          Are you sure the patch committed is the same one you ran tests on when you uploaded ant-test.txt? (I verified that the one that got committed is indeed the newest attached to this JIRA)

          Show
          Todd Lipcon added a comment - It seems that TestSetupTaskScheduling is failing in trunk builds since this was committed. It fails for me on trunk and passes again if I revert this change locally. Are you sure the patch committed is the same one you ran tests on when you uploaded ant-test.txt? (I verified that the one that got committed is indeed the newest attached to this JIRA)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #575 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/575/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #575 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/575/ )
          Hide
          Scott Chen added a comment -

          I just committed this. Thanks Liyin.

          Show
          Scott Chen added a comment - I just committed this. Thanks Liyin.
          Hide
          Liyin Liang added a comment -

          Hi Scott,
          The ant-test.txt file is the result of "ant test". The result of "ant test-patch" is as follows:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.

          Show
          Liyin Liang added a comment - Hi Scott, The ant-test.txt file is the result of "ant test". The result of "ant test-patch" is as follows: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile.
          Hide
          Scott Chen added a comment -

          Hey Liyin,
          I guess Hudson is broken as always.
          Is it possible that you can run "ant test" and "ant test-patch" and attach the test result here?

          Show
          Scott Chen added a comment - Hey Liyin, I guess Hudson is broken as always. Is it possible that you can run "ant test" and "ant test-patch" and attach the test result here?
          Hide
          Scott Chen added a comment -

          +1 Thanks for the change.
          I will commit this once the Hudson result is OK.

          Show
          Scott Chen added a comment - +1 Thanks for the change. I will commit this once the Hudson result is OK.
          Hide
          Liyin Liang added a comment -

          Hi Scott,
          I agree with you. According to MAPREDUCE-2118, maybe getJobSetupAndCleanupTasks will not hold JT lock in the future.

          I have changed the name of the method hasFailedAndNeedCleanupTask() to hasFailedUncleanTask(). Thanks for your advice.

          Show
          Liyin Liang added a comment - Hi Scott, I agree with you. According to MAPREDUCE-2118 , maybe getJobSetupAndCleanupTasks will not hold JT lock in the future. I have changed the name of the method hasFailedAndNeedCleanupTask() to hasFailedUncleanTask(). Thanks for your advice.
          Hide
          Scott Chen added a comment -

          Hi Liyin,
          Sorry for the late reply. I think it is OK to call hasFailedAndNeedCleanupTask(). Compared to other things we do in a heartbeat, this one should be OK.
          What do you think?

          The last patch looks good to me. I just have a nitpick.
          Can we change the name of the method hasFailedAndNeedCleanupTask() to hasFailedUncleanTask() ?

          Show
          Scott Chen added a comment - Hi Liyin, Sorry for the late reply. I think it is OK to call hasFailedAndNeedCleanupTask(). Compared to other things we do in a heartbeat, this one should be OK. What do you think? The last patch looks good to me. I just have a nitpick. Can we change the name of the method hasFailedAndNeedCleanupTask() to hasFailedUncleanTask() ?
          Hide
          Liyin Liang added a comment -

          move the logic to server side according to Scott's comment.

          Show
          Liyin Liang added a comment - move the logic to server side according to Scott's comment.
          Hide
          Liyin Liang added a comment -

          Hi Scott,
          If we move this logic to server side, every heartbeat has to call hasFailedAndNeedCleanupTaskToReport() inside the lock of JobTracker. Is there be a performance loss of jt?

          Show
          Liyin Liang added a comment - Hi Scott, If we move this logic to server side, every heartbeat has to call hasFailedAndNeedCleanupTaskToReport() inside the lock of JobTracker. Is there be a performance loss of jt?
          Hide
          Scott Chen added a comment -

          Hi Liyin,
          Thanks for the work. Taking a second look at this one, I think it may be better to move this logic to server side.
          We can call hasFailedAndNeedCleanupTaskToReport() in JobTracker.getSetupAndCleanupTasks().
          This way the TT can still get other tasks from taskScheduler.assignTasks() and immediately run tasks without waiting the next heartbeat.
          What do you think?

          Show
          Scott Chen added a comment - Hi Liyin, Thanks for the work. Taking a second look at this one, I think it may be better to move this logic to server side. We can call hasFailedAndNeedCleanupTaskToReport() in JobTracker.getSetupAndCleanupTasks(). This way the TT can still get other tasks from taskScheduler.assignTasks() and immediately run tasks without waiting the next heartbeat. What do you think?
          Hide
          Liyin Liang added a comment -

          Patch with unit test for trunk. The patch just added a assert on TestTaskFail.java to test the feature.

          Show
          Liyin Liang added a comment - Patch with unit test for trunk. The patch just added a assert on TestTaskFail.java to test the feature.
          Hide
          Scott Chen added a comment -

          Liyin: That's great. I will review your patch when you post it here.

          Show
          Scott Chen added a comment - Liyin: That's great. I will review your patch when you post it here.
          Hide
          Liyin Liang added a comment -

          Hi Scott,
          I'm happy to work on this JIRA and provide a patch with unit test for trunk.

          Show
          Liyin Liang added a comment - Hi Scott, I'm happy to work on this JIRA and provide a patch with unit test for trunk.
          Hide
          Scott Chen added a comment -

          Liyin: Cool. The approach looks good. We can also do the same thing for Task-Cleanup task.
          Would you mind working on this JIRA? Maybe provide a patch for trunk and add a unit test?

          Show
          Scott Chen added a comment - Liyin: Cool. The approach looks good. We can also do the same thing for Task-Cleanup task. Would you mind working on this JIRA? Maybe provide a patch for trunk and add a unit test?
          Hide
          Liyin Liang added a comment -

          For hadoop 0.19.1

          Show
          Liyin Liang added a comment - For hadoop 0.19.1
          Hide
          Liyin Liang added a comment -

          Hi Scott,
          Our product cluster met a similar problem about job setup-task. We let TT don't ask for new task when report a failed job setup/cleanup task in heartbeat to fix this issue. I'll attach our patch based on 0.19.1.

          Show
          Liyin Liang added a comment - Hi Scott, Our product cluster met a similar problem about job setup-task. We let TT don't ask for new task when report a failed job setup/cleanup task in heartbeat to fix this issue. I'll attach our patch based on 0.19.1.

            People

            • Assignee:
              Liyin Liang
              Reporter:
              Scott Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development