Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1028

Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Makes taskCleanup tasks to use 1 slot even for high memory jobs.

      Description

      A cleanup task is launched for a failed task of a job. This task is created based on the TIP of the failed task, and so is marked as requiring as many slots to run as the original task itself. For instance, if a high RAM job requires 2 slots per task, a cleanup task of the high RAM jobs requires 2 slots as well.

      Further, a cleanup task is scheduled to a tasktracker by the jobtracker itself and not the scheduler. While doing so, the JT doesn't check if the TT has enough slots free to run a high RAM cleanup task - always assuming 1 slot is enough. Thus, a task is oversubscribed to the TT.

      However, on the TT, before launch, we check that the task can actually run, and wait for so many slots to become available. If the slots don't get freed quickly, we will have tasks stuck in an unassigned state.

      1. MR-1028.patch
        1 kB
        Ravi Gummadi
      2. MR-1028.v1.patch
        12 kB
        Ravi Gummadi
      3. MR-1028.v1.1.patch
        13 kB
        Ravi Gummadi
      4. yhadoop-0.20-MR1028.patch
        1 kB
        Jothi Padmanabhan

        Activity

        Hemanth Yamijala created issue -
        Hide
        Arun C Murthy added a comment -

        So, I assume the fix is to ensure that the cleanup task uses only 1 slot?

        Show
        Arun C Murthy added a comment - So, I assume the fix is to ensure that the cleanup task uses only 1 slot?
        Hide
        Devaraj Das added a comment -

        I think that should be it. That would also lead to non-wastage of slots.

        Show
        Devaraj Das added a comment - I think that should be it. That would also lead to non-wastage of slots.
        Hide
        Arun C Murthy added a comment -

        On second thoughts we need to be a little careful here... if we are re-using the existing JVM we should not bother with manipulating the #slots occupied by the JVM, OTOH if we need a new JVM we should be using just 1. Right?

        Show
        Arun C Murthy added a comment - On second thoughts we need to be a little careful here... if we are re-using the existing JVM we should not bother with manipulating the #slots occupied by the JVM, OTOH if we need a new JVM we should be using just 1. Right?
        Hide
        Arun C Murthy added a comment -

        On second thoughts [...]

        To clarify, my comment was pointing out scenarios where we jvm-reuse is on and we use the JVM of a successful task to run the cleanup task of an unrelated failed task.

        Show
        Arun C Murthy added a comment - On second thoughts [...] To clarify, my comment was pointing out scenarios where we jvm-reuse is on and we use the JVM of a successful task to run the cleanup task of an unrelated failed task.
        Hide
        Devaraj Das added a comment -

        Hmm.. It should be okay to have a JVM occupying higher number of slots run a task that requires fewer slots. However, we need to fix the TaskTracker.TaskInProgress.releaseSlot. I am thinking that it might make sense to keep track of the slot count per JVM (long term, we anyway should be monitoring the resources being used by the JVM per se). Today, in releaseSlot, we release #slots equal to the number of slots that the task took to run. Instead it could just decrement the slot count by the number of slots the JVM took to run the task. Also, when the task is assigned to the TT, the JobInProgress.

        {obtainTaskCleanupTask,obtainJobCleanupTask,obtainJobSetupTask}

        methods should specifically sets the #slots required to 1 (today that's the only way to let the TT know that the task would require 1 slot).

        The other option is to have the JobTracker be aware of slot counts for the special tasks. Since the special tasks are scheduled directly by the JobTracker, that would be required to be done.

        Show
        Devaraj Das added a comment - Hmm.. It should be okay to have a JVM occupying higher number of slots run a task that requires fewer slots. However, we need to fix the TaskTracker.TaskInProgress.releaseSlot. I am thinking that it might make sense to keep track of the slot count per JVM (long term, we anyway should be monitoring the resources being used by the JVM per se). Today, in releaseSlot, we release #slots equal to the number of slots that the task took to run. Instead it could just decrement the slot count by the number of slots the JVM took to run the task. Also, when the task is assigned to the TT, the JobInProgress. {obtainTaskCleanupTask,obtainJobCleanupTask,obtainJobSetupTask} methods should specifically sets the #slots required to 1 (today that's the only way to let the TT know that the task would require 1 slot). The other option is to have the JobTracker be aware of slot counts for the special tasks. Since the special tasks are scheduled directly by the JobTracker, that would be required to be done.
        Hide
        Devaraj Das added a comment -

        After some thought, it seems like decrementing the slot count on a per task-used-slot count basis is harmless.. So, for now, let's just ensure that all special tasks (job-setup, task-cleanup and job-cleanup) take exactly one slot. I couldn't come up with a counter-example where this would lead to inconsistencies in the slot counts on the TT, or, would lead to fewer/more tasks to be launched than should be as per the slot count and the #slots required by tasks scheduled on that TT.

        Show
        Devaraj Das added a comment - After some thought, it seems like decrementing the slot count on a per task-used-slot count basis is harmless.. So, for now, let's just ensure that all special tasks (job-setup, task-cleanup and job-cleanup) take exactly one slot. I couldn't come up with a counter-example where this would lead to inconsistencies in the slot counts on the TT, or, would lead to fewer/more tasks to be launched than should be as per the slot count and the #slots required by tasks scheduled on that TT.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch with the fix. Writing testcase is in progress.

        Show
        Ravi Gummadi added a comment - Attaching patch with the fix. Writing testcase is in progress.
        Ravi Gummadi made changes -
        Field Original Value New Value
        Attachment MR-1028.patch [ 12420466 ]
        Hide
        Devaraj Das added a comment -

        The changes in the patch look fine. This should work with JVM reuse set to ON too.

        Show
        Devaraj Das added a comment - The changes in the patch look fine. This should work with JVM reuse set to ON too.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch with the testcase.
        Please review and provide your comments.

        Show
        Ravi Gummadi added a comment - Attaching patch with the testcase. Please review and provide your comments.
        Ravi Gummadi made changes -
        Attachment MR-1028.v1.patch [ 12420542 ]
        Ravi Gummadi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Makes taskCleanup tasks to use 1 slot even for high memory jobs.
        Hide
        Jothi Padmanabhan added a comment -

        Looks good. Some minor nits.

        1. Change the name of isMap arugment in TestSetupTaskScheduling.addNewTaskStatus
        2. Add the error strings to asserts
        3. Minor typo in one of the comments (should read reduce instead of map) in testNumSlotsUsedForTaskCleanup
        Show
        Jothi Padmanabhan added a comment - Looks good. Some minor nits. Change the name of isMap arugment in TestSetupTaskScheduling.addNewTaskStatus Add the error strings to asserts Minor typo in one of the comments (should read reduce instead of map) in testNumSlotsUsedForTaskCleanup
        Jothi Padmanabhan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Ravi Gummadi added a comment -

        Attaching new patch with the suggested changes.

        Show
        Ravi Gummadi added a comment - Attaching new patch with the suggested changes.
        Ravi Gummadi made changes -
        Attachment MR-1028.v1.1.patch [ 12420550 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420542/MR-1028.v1.patch
        against trunk revision 818674.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420542/MR-1028.v1.patch against trunk revision 818674. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h1.grid.sp2.yahoo.net/2/console This message is automatically generated.
        Ravi Gummadi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Ravi Gummadi added a comment -

        Unit tests passed on my local machine with the latest patch.

        Show
        Ravi Gummadi added a comment - Unit tests passed on my local machine with the latest patch.
        Jothi Padmanabhan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Jothi Padmanabhan added a comment -

        For some reason, Hudson did not pick this up. Retrying.

        Show
        Jothi Padmanabhan added a comment - For some reason, Hudson did not pick this up. Retrying.
        Jothi Padmanabhan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420550/MR-1028.v1.1.patch
        against trunk revision 818830.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420550/MR-1028.v1.1.patch against trunk revision 818830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/133/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        +1. Patch looks fine to me.
        Also, the failed test, TestCopyFiles is a known issue.

        Show
        Jothi Padmanabhan added a comment - +1. Patch looks fine to me. Also, the failed test, TestCopyFiles is a known issue.
        Hide
        Nigel Daley added a comment -

        Why is taskStatuses now protected? You've now made it part of the public API.

        + protected TreeMap<TaskAttemptID,TaskStatus> taskStatuses =

        It should be package-private, no? That should still enable the unit test to inspect it.

        Show
        Nigel Daley added a comment - Why is taskStatuses now protected? You've now made it part of the public API. + protected TreeMap<TaskAttemptID,TaskStatus> taskStatuses = It should be package-private, no? That should still enable the unit test to inspect it.
        Hide
        Nigel Daley added a comment -

        Ok, this class is package-private so I guess this is moot. Still, package-private is always better than protected if at all possible.

        Show
        Nigel Daley added a comment - Ok, this class is package-private so I guess this is moot. Still, package-private is always better than protected if at all possible.
        Hide
        Jothi Padmanabhan added a comment -

        Patch for the Y-Hadoop distribution

        Show
        Jothi Padmanabhan added a comment - Patch for the Y-Hadoop distribution
        Jothi Padmanabhan made changes -
        Attachment yhadoop-0.20-MR1028.patch [ 12420581 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #66 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/66/)
        . Fixed number of slots occupied by cleanup tasks to one irrespective of slot size for the job. Contributed by Ravi Gummadi.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #66 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/66/ ) . Fixed number of slots occupied by cleanup tasks to one irrespective of slot size for the job. Contributed by Ravi Gummadi.
        Hide
        Hemanth Yamijala added a comment -

        Ok, this class is package-private so I guess this is moot. Still, package-private is always better than protected if at all possible.

        Filed MAPREDUCE-1041 to fix this.

        Show
        Hemanth Yamijala added a comment - Ok, this class is package-private so I guess this is moot. Still, package-private is always better than protected if at all possible. Filed MAPREDUCE-1041 to fix this.
        Hide
        Hemanth Yamijala added a comment -

        I just committed this to trunk and the Hadoop 0.21 branch. Thanks, Ravi !

        Show
        Hemanth Yamijala added a comment - I just committed this to trunk and the Hadoop 0.21 branch. Thanks, Ravi !
        Hemanth Yamijala made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420550/MR-1028.v1.1.patch
        against trunk revision 818830.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420550/MR-1028.v1.1.patch against trunk revision 818830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/134/console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        The yahoo-20 patch is missing test-cases...

        Show
        Arun C Murthy added a comment - The yahoo-20 patch is missing test-cases...
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        3h 37m 2 Jothi Padmanabhan 25/Sep/09 14:57
        Open Open Patch Available Patch Available
        1d 23h 36m 3 Jothi Padmanabhan 25/Sep/09 14:58
        Patch Available Patch Available Resolved Resolved
        4h 28m 1 Hemanth Yamijala 25/Sep/09 19:26
        Resolved Resolved Closed Closed
        333d 2h 51m 1 Tom White 24/Aug/10 22:18

          People

          • Assignee:
            Ravi Gummadi
            Reporter:
            Hemanth Yamijala
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development