Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2214

TaskTracker should release slot if task is not launched

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      TaskTracker.TaskInProgress.launchTask() does not launch a task if it is not in an expected state. However, in the case where the task is not launched, the slot is not released. We have observed this in production - the task was in SUCCEEDED state by the time launchTask() got to it and then the slot was never released. It is not clear how the task got into that state, but it is better to handle the case.

      1. MAPREDUCE-2214.patch
        2 kB
        Ramkumar Vadali

        Activity

        Hide
        dking Dick King added a comment -

        Speculative execution is a legitimate way a task can become SUCCEEDED while an attempt on that task is waiting to get launched.

        Show
        dking Dick King added a comment - Speculative execution is a legitimate way a task can become SUCCEEDED while an attempt on that task is waiting to get launched.
        Hide
        jsensarma Joydeep Sen Sarma added a comment -

        i think what happened in our case was something like this:

        1. task was requested to be killed
        2. the TT performed the kill action and reported back to the JT
        3. but the task reported back as done - at which point the TT promptly moved it into the SUCCEEDED state
        4. meanwhile the JT scheduled a cleanup and the cleanup failed to launch without returning the slot

        the cris-crossing of #2 and #3 was what was unexpected i think (something the code doesn't anticipate).

        we don't hit this problem with speculation because we never request speculation when the task is about to complete (there's a check on the remaining time on the task and if the remaining time is less than N min - we don't speculate. there's a jira for this - don't remember which).

        Show
        jsensarma Joydeep Sen Sarma added a comment - i think what happened in our case was something like this: task was requested to be killed the TT performed the kill action and reported back to the JT but the task reported back as done - at which point the TT promptly moved it into the SUCCEEDED state meanwhile the JT scheduled a cleanup and the cleanup failed to launch without returning the slot the cris-crossing of #2 and #3 was what was unexpected i think (something the code doesn't anticipate). we don't hit this problem with speculation because we never request speculation when the task is about to complete (there's a check on the remaining time on the task and if the remaining time is less than N min - we don't speculate. there's a jira for this - don't remember which).
        Hide
        rvadali Ramkumar Vadali added a comment -

        TEST RESULTS

        ant test-patch complains about unit-tests, but its difficult to come up with a unit-test for this.

             [exec] -1 overall.
             [exec]
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec]
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no new tests are needed for this patch.
             [exec]                         Also please list what manual steps were performed to verify this patch.
             [exec]
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec]
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
             [exec]
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
             [exec]
             [exec]     +1 system test framework.  The patch passed system test framework compile.
             [exec]
             [exec]
             [exec]
             [exec]
             [exec] ======================================================================
             [exec] ======================================================================
             [exec]     Finished build.
             [exec] ======================================================================
             [exec] ======================================================================
             [exec]
             [exec]
        

        ant test: there was only one test failure, but that fails in a clean checkout too.

            [junit] Test org.apache.hadoop.mapred.TestControlledMapReduceJob FAILED (timeout)
        
        Show
        rvadali Ramkumar Vadali added a comment - TEST RESULTS ant test-patch complains about unit-tests, but its difficult to come up with a unit-test for this. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ====================================================================== [exec] [exec] ant test: there was only one test failure, but that fails in a clean checkout too. [junit] Test org.apache.hadoop.mapred.TestControlledMapReduceJob FAILED (timeout)
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12466033/MAPREDUCE-2214.patch
        against trunk revision 1074251.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12466033/MAPREDUCE-2214.patch against trunk revision 1074251. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/48//console This message is automatically generated.
        Hide
        acmurthy Arun C Murthy added a comment -

        Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.

        Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.

        Show
        acmurthy Arun C Murthy added a comment - Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks. Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.

          People

          • Assignee:
            rvadali Ramkumar Vadali
            Reporter:
            rvadali Ramkumar Vadali
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development