Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4890

Invalid TaskImpl state transitions when task fails while speculating

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.2-alpha, 0.23.5
    • Fix Version/s: 2.0.3-alpha, 0.23.6
    • Component/s: mr-am
    • Labels:
      None

      Description

      There are a couple of issues when a task fails while speculating (i.e.: multiple attempts are active):

      1. The other active attempts are not killed.
      2. TaskImpl's FAILED state does not handle the T_ATTEMPT_* set of events which can be sent from the other active attempts. These all need to be handled since they can be sent asynchronously from the other active task attempts.

      Failure to handle this properly means jobs that are configured to normally tolerate failures via mapreduce.map.failures.maxpercent or mapreduce.reduce.failures.maxpercent and also speculate can easily end up failing due to invalid state transitions rather than complete successfully with a few explicitly allowed task failures.

        Activity

        Hide
        Jason Lowe added a comment -

        Example exception trace when a speculative attempt fails after the task already failed:

        2012-12-18 01:06:35,885 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1354689281155_256490_m_000000_4 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
        2012-12-18 01:06:35,887 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Can't handle this event at current state for task_1354689281155_256490_m_000000
        org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: T_ATTEMPT_FAILED at FAILED
        	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
        	at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handle(TaskImpl.java:642)
        	at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handle(TaskImpl.java:95)
        	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher.handle(MRAppMaster.java:984)
        	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher.handle(MRAppMaster.java:978)
        	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128)
        	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
        	at java.lang.Thread.run(Thread.java:619)
        2012-12-18 01:06:35,888 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Invalid event T_ATTEMPT_FAILED on Task task_1354689281155_256490_m_000000
        2012-12-18 01:06:35,909 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1354689281155_256490Job Transitioned from RUNNING to ERROR
        
        Show
        Jason Lowe added a comment - Example exception trace when a speculative attempt fails after the task already failed: 2012-12-18 01:06:35,885 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1354689281155_256490_m_000000_4 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED 2012-12-18 01:06:35,887 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Can't handle this event at current state for task_1354689281155_256490_m_000000 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: T_ATTEMPT_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handle(TaskImpl.java:642) at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handle(TaskImpl.java:95) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher.handle(MRAppMaster.java:984) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher.handle(MRAppMaster.java:978) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 2012-12-18 01:06:35,888 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Invalid event T_ATTEMPT_FAILED on Task task_1354689281155_256490_m_000000 2012-12-18 01:06:35,909 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1354689281155_256490Job Transitioned from RUNNING to ERROR
        Hide
        Jason Lowe added a comment -

        Note that it appears the task KILLED state also needs to handle the various T_ATTEMPT_* events since they could arrive asynchronously and legitimately be received in that state.

        Show
        Jason Lowe added a comment - Note that it appears the task KILLED state also needs to handle the various T_ATTEMPT_* events since they could arrive asynchronously and legitimately be received in that state.
        Hide
        Jason Lowe added a comment -

        I was wrong about the KILLED state. KILL_WAIT should handle cleaning up any lingering attempts, and by the time the task transitions from KILL_WAIT to KILLED there should be no active task attempts and therefore no chance of receiving T_ATTEMPT_* events.

        Show
        Jason Lowe added a comment - I was wrong about the KILLED state. KILL_WAIT should handle cleaning up any lingering attempts, and by the time the task transitions from KILL_WAIT to KILLED there should be no active task attempts and therefore no chance of receiving T_ATTEMPT_* events.
        Hide
        Jason Lowe added a comment -

        Patch to kill active attempts when a task transitions to the FAILED state and also ingore all T_ATTEMPT_* events while in the FAILED state.

        Show
        Jason Lowe added a comment - Patch to kill active attempts when a task transitions to the FAILED state and also ingore all T_ATTEMPT_* events while in the FAILED state.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561746/MAPREDUCE-4890.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3140//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3140//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561746/MAPREDUCE-4890.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3140//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3140//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        +1 looks good. Thanks Jason. Go ahead and commit.

        Show
        Thomas Graves added a comment - +1 looks good. Thanks Jason. Go ahead and commit.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3154 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3154/)
        MAPREDUCE-4890. Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223)

        Result = SUCCESS
        jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3154 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3154/ ) MAPREDUCE-4890 . Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Jason Lowe added a comment -

        Thanks for the review, Tom. I committed this to trunk, branch-2, and branch-0.23.

        Show
        Jason Lowe added a comment - Thanks for the review, Tom. I committed this to trunk, branch-2, and branch-0.23.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #73 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/73/)
        MAPREDUCE-4890. Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223)

        Result = SUCCESS
        jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #73 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/73/ ) MAPREDUCE-4890 . Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #471 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/471/)
        svn merge -c 1425223 FIXES: MAPREDUCE-4890. Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425227)

        Result = UNSTABLE
        jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425227
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #471 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/471/ ) svn merge -c 1425223 FIXES: MAPREDUCE-4890 . Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425227) Result = UNSTABLE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425227 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1262 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1262/)
        MAPREDUCE-4890. Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223)

        Result = FAILURE
        jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1262 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1262/ ) MAPREDUCE-4890 . Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1292 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1292/)
        MAPREDUCE-4890. Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223)

        Result = FAILURE
        jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1292 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1292/ ) MAPREDUCE-4890 . Invalid TaskImpl state transitions when task fails while speculating. Contributed by Jason Lowe (Revision 1425223) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425223 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development