Hadoop Common
  1. Hadoop Common
  2. HADOOP-1281

Speculative map tasks aren't getting killed although the TIP completed

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.15.0
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None

      Description

      The speculative map tasks run to completion although the TIP succeeded since the other task completed elsewhere.

      1. HADOOP-1281_1_20071117.patch
        0.6 kB
        Arun C Murthy
      2. HADOOP-1281_2_20071123.patch
        1 kB
        Arun C Murthy
      3. HADOOP-1281_2_20080109.patch
        2 kB
        Arun C Murthy

        Activity

        Hide
        Lohit Vijayarenu added a comment -

        We hit this bug today.
        Below is the log for 2 attempts for same task
        <log>
        Task Attempts Status Progress Start Time Finish Time Errors Task Logs Counters
        task_200711022153_0001_m_001548_0 SUCCEEDED 100.00% 2-Nov-2007 22:00:59 2-Nov-2007 22:05:50 (4mins, 51sec)
        task_200711022153_0001_m_001548_1 KILLED 84.44% 2-Nov-2007 22:02:17 2-Nov-2007 22:26:02 (23mins, 45sec)
        </log>

        If you look at the time each of the attempt took, after the first attempt finished in ~4mins, the second attempt should have been killed. But it went ahead and was running for ~23min. When we took a look at the logs, we saw that, the attempt was issued a kill signal after the whole job was completed.
        The JobTracker did not send Kill signal to this task attempt (Or may be nothing was logged).

        Show
        Lohit Vijayarenu added a comment - We hit this bug today. Below is the log for 2 attempts for same task <log> Task Attempts Status Progress Start Time Finish Time Errors Task Logs Counters task_200711022153_0001_m_001548_0 SUCCEEDED 100.00% 2-Nov-2007 22:00:59 2-Nov-2007 22:05:50 (4mins, 51sec) task_200711022153_0001_m_001548_1 KILLED 84.44% 2-Nov-2007 22:02:17 2-Nov-2007 22:26:02 (23mins, 45sec) </log> If you look at the time each of the attempt took, after the first attempt finished in ~4mins, the second attempt should have been killed. But it went ahead and was running for ~23min. When we took a look at the logs, we saw that, the attempt was issued a kill signal after the whole job was completed. The JobTracker did not send Kill signal to this task attempt (Or may be nothing was logged).
        Hide
        Arun C Murthy added a comment -

        Straight-forward patch. I've done some preliminary testing, need to do more.

        Show
        Arun C Murthy added a comment - Straight-forward patch. I've done some preliminary testing, need to do more.
        Hide
        Arun C Murthy added a comment -

        Submitting patch for review, done with testing.

        Show
        Arun C Murthy added a comment - Submitting patch for review, done with testing.
        Hide
        Milind Bhandarkar added a comment -

        Arun,

        Do you remember why the original code was explicitly not killing the speculative attempts of map tasks ?

        Show
        Milind Bhandarkar added a comment - Arun, Do you remember why the original code was explicitly not killing the speculative attempts of map tasks ?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12369687/HADOOP-1281_1_20071117.patch
        against trunk revision r596418.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests -1. The patch failed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12369687/HADOOP-1281_1_20071117.patch against trunk revision r596418. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1119/console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        Do you remember why the original code was explicitly not killing the speculative attempts of map tasks ?

        Actually I don't!

        It was a long while ago (almost a year) and probably had to do with the fact that we never ran into issues like this since not many people used speculative-execution back then. smile

        Show
        Arun C Murthy added a comment - Do you remember why the original code was explicitly not killing the speculative attempts of map tasks ? Actually I don't! It was a long while ago (almost a year) and probably had to do with the fact that we never ran into issues like this since not many people used speculative-execution back then. smile
        Hide
        Arun C Murthy added a comment -

        I just committed this.

        Show
        Arun C Murthy added a comment - I just committed this.
        Hide
        Arun C Murthy added a comment -

        This seems to have introduced sporadic failures in some test-cases as noted by HADOOP--2252/HADOOP-2254, I'll investigate and fix those. In the meanwhile I'm reverting this patch.

        Show
        Arun C Murthy added a comment - This seems to have introduced sporadic failures in some test-cases as noted by HADOOP--2252/ HADOOP-2254 , I'll investigate and fix those. In the meanwhile I'm reverting this patch.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-Nightly #311 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/311/ )
        Hide
        Arun C Murthy added a comment -

        Updated patch to fix the test-case flakiness... I'll continue testing this.

        Show
        Arun C Murthy added a comment - Updated patch to fix the test-case flakiness... I'll continue testing this.
        Hide
        Arun C Murthy added a comment -

        I'm marking up the priority to reflect that this is an important bug to fix for 0.16.0, we are losing lots of cycles due to this.

        Show
        Arun C Murthy added a comment - I'm marking up the priority to reflect that this is an important bug to fix for 0.16.0, we are losing lots of cycles due to this.
        Hide
        Arun C Murthy added a comment -

        I finally got around to testing this patch throughly, hence marking it PA.

        Show
        Arun C Murthy added a comment - I finally got around to testing this patch throughly, hence marking it PA.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12370124/HADOOP-1281_2_20071123.patch
        against trunk revision .

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12370124/HADOOP-1281_2_20071123.patch against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1520/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        +1

        Show
        Devaraj Das added a comment - +1
        Hide
        Arun C Murthy added a comment -

        Exact same patch as before, but added comments rationalizing the fix...

        Show
        Arun C Murthy added a comment - Exact same patch as before, but added comments rationalizing the fix...
        Hide
        Arun C Murthy added a comment -

        I just committed this.

        Show
        Arun C Murthy added a comment - I just committed this.

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development