Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4425

Speculation + Fetch failures can lead to a hung job

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.1
    • Fix Version/s: 3.0.0, 2.0.3-alpha, 0.23.5
    • Component/s: mrv2
    • Labels:
      None

      Description

      After a task goes to SUCCEEDED, FAILED/KILLED attempts are ignored.
      1. attemp1 starts
      2. speculative attempt starts
      3. attempt 1 completes - Task moves to SUCCEEDED state
      4. speculative attempt is KILLED
      5. T_ATTEMPT_KILLED is ignored.
      6. attemp1 1 fails with TOO_MANY_FETCH_FAILURES
      The job will effectively hang, since a new task attempt isn't started.

      1. MAPREDUCE-4425-branch23.patch
        6 kB
        Jason Lowe
      2. MAPREDUCE-4425.patch
        6 kB
        Jason Lowe
      3. MAPREDUCE-4425.patch
        6 kB
        Jason Lowe

        Activity

        Hide
        Jason Lowe added a comment -

        This is similar to the KILL_WAIT hang reported in MAPREDUCE-4751 because it's caused by incorrect bookkeeping in TaskImpl. Since it ignores attempts completing after the task succeeds, it incorrectly thinks it has uncompleted attempts running after a fetch failure and therefore doesn't launch a new attempt.

        Patch to fix the bookkeeping for attempts that complete while we're in the SUCCEEDED state so subsequent fetch failures will cause a new attempt to be launched.

        Show
        Jason Lowe added a comment - This is similar to the KILL_WAIT hang reported in MAPREDUCE-4751 because it's caused by incorrect bookkeeping in TaskImpl. Since it ignores attempts completing after the task succeeds, it incorrectly thinks it has uncompleted attempts running after a fetch failure and therefore doesn't launch a new attempt. Patch to fix the bookkeeping for attempts that complete while we're in the SUCCEEDED state so subsequent fetch failures will cause a new attempt to be launched.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12552998/MAPREDUCE-4425.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3009//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3009//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552998/MAPREDUCE-4425.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3009//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3009//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        Upmerged patch to trunk since MAPREDUCE-4751 was integrated.

        Show
        Jason Lowe added a comment - Upmerged patch to trunk since MAPREDUCE-4751 was integrated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553125/MAPREDUCE-4425.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3012//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3012//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553125/MAPREDUCE-4425.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3012//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3012//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        The patch looks fine to me. When a task fails or is killed retroactively we remove it from inProgess and add it to finished. Also when a task succeeds after already succeeding we add it to finished and remove it from inProgess.

        +1

        I'll check it in.

        Show
        Robert Joseph Evans added a comment - The patch looks fine to me. When a task fails or is killed retroactively we remove it from inProgess and add it to finished. Also when a task succeeds after already succeeding we add it to finished and remove it from inProgess. +1 I'll check it in.
        Hide
        Robert Joseph Evans added a comment -

        I put this into trunk, and branch-2, but it does not appear to be a clean merge to 0.23. Could you provide another patch for that?

        Show
        Robert Joseph Evans added a comment - I put this into trunk, and branch-2, but it does not appear to be a clean merge to 0.23. Could you provide another patch for that?
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3001 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3001/)
        MAPREDUCE-4425. Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3001 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3001/ ) MAPREDUCE-4425 . Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Jason Lowe added a comment -

        Thanks for looking at this, Bobby. Here's a patch for branch-0.23. The RetroactiveKilledAtSucceeded transition is missing in 0.23, so I preserved the existing 0.23 behavior if an attempt that already succeeded was killed.

        Show
        Jason Lowe added a comment - Thanks for looking at this, Bobby. Here's a patch for branch-0.23. The RetroactiveKilledAtSucceeded transition is missing in 0.23, so I preserved the existing 0.23 behavior if an attempt that already succeeded was killed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553144/MAPREDUCE-4425-branch23.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3015//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553144/MAPREDUCE-4425-branch23.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3015//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Thanks for the new patch. It looks good +1.

        Show
        Robert Joseph Evans added a comment - Thanks for the new patch. It looks good +1.
        Hide
        Robert Joseph Evans added a comment -

        Thanks again for fixing this,

        I put it into trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks again for fixing this, I put it into trunk, branch-2, and branch-0.23
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #35 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/35/)
        MAPREDUCE-4425. Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #35 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/35/ ) MAPREDUCE-4425 . Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #434 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/434/)
        MAPREDUCE-4425. Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408411)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408411
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #434 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/434/ ) MAPREDUCE-4425 . Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408411) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408411 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1225 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1225/)
        MAPREDUCE-4425. Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1225 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1225/ ) MAPREDUCE-4425 . Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1256 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1256/)
        MAPREDUCE-4425. Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1256 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1256/ ) MAPREDUCE-4425 . Speculation + Fetch failures can lead to a hung job (jlowe via bobby) (Revision 1408360) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1408360 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Siddharth Seth
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development