Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-771

Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When a high RAM job's task is scheduled on a tasktracker, the number of slots it occupies will be more than one. If enough such tasks are scheduled, there can come a situation when there are no more slots free on the node but the number of tasks running is less than the number of slots. The jobtracker currently schedules a setup or cleanup task based on how many tasks are running on the system, rather than slots. As a result of this, it can schedule a setup or cleanup task on a tasktracker without any free slots. If the high RAM job's tasks are long running, this will significantly delay the running of the setup or cleanup task, and thus the entire job.

      1. MAPREDUCE-771-20.patch
        1 kB
        Hemanth Yamijala
      2. MAPREDUCE-771.patch
        10 kB
        Hemanth Yamijala
      3. MAPREDUCE-771.patch
        10 kB
        Hemanth Yamijala

        Activity

        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12413973/MAPREDUCE-771.patch
        against trunk revision 795489.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413973/MAPREDUCE-771.patch against trunk revision 795489. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/414/console This message is automatically generated.
        Hide
        Hemanth Yamijala added a comment -

        Patch (MAPREDUCE-771-20.patch) against the Yahoo! Hadoop distribution - not for committing.

        Show
        Hemanth Yamijala added a comment - Patch ( MAPREDUCE-771 -20.patch) against the Yahoo! Hadoop distribution - not for committing.
        Hide
        Hemanth Yamijala added a comment -

        I just committed this to trunk.

        Show
        Hemanth Yamijala added a comment - I just committed this to trunk.
        Hide
        Hemanth Yamijala added a comment -

        Results of ant test and test-contrib:

        All tests except TestKillSubProcesses passed. Checking the logs I found it is the same as the one documented in MAPREDUCE-408.

        I've also run functional tests on the Yahoo! distribution of Hadoop (essentially same source patch), and verified that the fix is fine. Will run a couple of more functional tests and commit.

        Show
        Hemanth Yamijala added a comment - Results of ant test and test-contrib: All tests except TestKillSubProcesses passed. Checking the logs I found it is the same as the one documented in MAPREDUCE-408 . I've also run functional tests on the Yahoo! distribution of Hadoop (essentially same source patch), and verified that the fix is fine. Will run a couple of more functional tests and commit.
        Hide
        Arun C Murthy added a comment -

        +1

        Show
        Arun C Murthy added a comment - +1
        Hide
        Hemanth Yamijala added a comment -

        test-patch results:

             [exec] -1 overall.
             [exec]
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec]
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec]
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec]
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec]
             [exec]     -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.
             [exec]
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        

        The eclipse classpath problem can be ignored.

        Show
        Hemanth Yamijala added a comment - test-patch results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. The eclipse classpath problem can be ignored.
        Hide
        Hemanth Yamijala added a comment -

        Attached patch incorporates Arun's review comment and checks that the returned task is a setup task.

        Show
        Hemanth Yamijala added a comment - Attached patch incorporates Arun's review comment and checks that the returned task is a setup task.
        Hide
        Arun C Murthy added a comment -

        +1, looks good.

        Minor improvement to the testcase would be to assert that that obtained task is indeed a setup/cleanup task with Task.isJobSetupTask.

        Show
        Arun C Murthy added a comment - +1, looks good. Minor improvement to the testcase would be to assert that that obtained task is indeed a setup/cleanup task with Task.isJobSetupTask .
        Hide
        Hemanth Yamijala added a comment -

        Attached patch fixes the issue in scheduling of setup and cleanup tasks. The fix is to use TaskTrackerStatus.countOccupied

        {Map|Reduce}

        Slots to get the right number of slots to account to.

        In order to test this in unit test style, I've marked the method JobTracker.getSetupAndCleanupTasks as package private. Introduced a new test class TestSetupTaskScheduling to test these conditions. Verified without the code fix the corresponding tests fail.

        Arun and I felt that changing the usage in other schedulers from using slots to tasks is probably best left to a different jira as the impact needs to be more carefully studied, and this is a simple bug that only affects setup and cleanup tasks in the case of capacity scheduler right now.

        I plan to run a real cluster test tomorrow manually to ensure things work fine still. But I would appreciate a review in the meantime.

        Show
        Hemanth Yamijala added a comment - Attached patch fixes the issue in scheduling of setup and cleanup tasks. The fix is to use TaskTrackerStatus.countOccupied {Map|Reduce} Slots to get the right number of slots to account to. In order to test this in unit test style, I've marked the method JobTracker.getSetupAndCleanupTasks as package private. Introduced a new test class TestSetupTaskScheduling to test these conditions. Verified without the code fix the corresponding tests fail. Arun and I felt that changing the usage in other schedulers from using slots to tasks is probably best left to a different jira as the impact needs to be more carefully studied, and this is a simple bug that only affects setup and cleanup tasks in the case of capacity scheduler right now. I plan to run a real cluster test tomorrow manually to ensure things work fine still. But I would appreciate a review in the meantime.
        Hide
        Hemanth Yamijala added a comment -

        Looking at usage of countMapTasks (and similar for reduces), all schedulers except capacity scheduler count tasks instead of slots. This is fine for now since no one else uses memory based scheduling to the best of my knowledge. Ideally they should count slots so that whenever they support memory based scheduling they get the right behavior.

        The other usage is that these values are set in ClusterStatus. As far as I can see, this is not a problem and looks fine to me because the usage is looking at global map tasks, rather than number of slots on a machine. I have not checked the usage in test cases.

        Show
        Hemanth Yamijala added a comment - Looking at usage of countMapTasks (and similar for reduces), all schedulers except capacity scheduler count tasks instead of slots. This is fine for now since no one else uses memory based scheduling to the best of my knowledge. Ideally they should count slots so that whenever they support memory based scheduling they get the right behavior. The other usage is that these values are set in ClusterStatus. As far as I can see, this is not a problem and looks fine to me because the usage is looking at global map tasks, rather than number of slots on a machine. I have not checked the usage in test cases.
        Hide
        Hemanth Yamijala added a comment -

        In JobTracker.getSetupAndCleanupTasks(), before a task is assigned to a map slot or a reduce slot, we count the number of running maps (TaskTrackerStatus.countMapTasks) / reduces on the tracker from its status instead of map / reduce slots. One simple solution could be to use TaskTrackerStatus.countOccupiedMapSlots. Ideally we should also check if this bug appears elsewhere in the code.

        Show
        Hemanth Yamijala added a comment - In JobTracker.getSetupAndCleanupTasks(), before a task is assigned to a map slot or a reduce slot, we count the number of running maps (TaskTrackerStatus.countMapTasks) / reduces on the tracker from its status instead of map / reduce slots. One simple solution could be to use TaskTrackerStatus.countOccupiedMapSlots. Ideally we should also check if this bug appears elsewhere in the code.

          People

          • Assignee:
            Hemanth Yamijala
            Reporter:
            Hemanth Yamijala
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development