Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5924

Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.5.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The Sort job over 1GB data failed with below error

      2014-06-09 09:15:38,746 INFO [Socket Reader #1 for port 63415] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1402304714683_0002 (auth:SIMPLE)
      2014-06-09 09:15:38,750 INFO [IPC Server handler 13 on 63415] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state update from attempt_1402304714683_0002_r_000015_1000
      2014-06-09 09:15:38,751 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this event at current state for attempt_1402304714683_0002_r_000015_1000
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1058)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:145)
              at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1271)
              at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1263)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
              at java.lang.Thread.run(Thread.java:722)
      2014-06-09 09:15:38,753 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1402304714683_0002Job Transitioned from RUNNING to ERROR
      
      

      The JobHistory Url prints job state = ERROR

        Activity

        Hide
        Zhijie Shen added a comment -

        Some investigation based our logs:

        TA_COMMIT_PENDING at COMMIT_PENDING happens because there're two TA_COMMIT_PENDING that have been sent to TaskAttemptListenerImpl. The first one causes the state transition from RUNNING to COMMIT_PENDING, and the second one causes the invalid transition.

        Here's the code that notifies AM of committing the pending task.

        // say the task tracker that task is commit pending
              while (true) {
                try {
                  umbilical.commitPending(taskId, taskStatus);
                  break;
                } catch (InterruptedException ie) {
                  // ignore
                } catch (IOException ie) {
                  LOG.warn("Failure sending commit pending: " + 
                            StringUtils.stringifyException(ie));
                  if (--retries == 0) {
                    System.exit(67);
                  }
                }
              }
        

        The loop has been iterated twice, because the following exception happens:

        2014-06-09 09:15:38,539 WARN [main] org.apache.hadoop.mapred.Task: Failure sending commit pending: java.io.IOException: Failed on local exception: java.io.IOException: An existing connection was forcibly closed by the remote host; Host Details : local host is: "sdevin-clster53/10.215.16.72"; destination host is: "sdevin-clster54":63415; 
        	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
        	at org.apache.hadoop.ipc.Client.call(Client.java:1414)
        	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
        	at com.sun.proxy.$Proxy9.commitPending(Unknown Source)
        	at org.apache.hadoop.mapred.Task.done(Task.java:1006)
        	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:397)
        	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:415)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
        	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
        Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
        	at sun.nio.ch.SocketDispatcher.read0(Native Method)
        	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
        	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
        	at sun.nio.ch.IOUtil.read(IOUtil.java:198)
        	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
        	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        	at java.io.FilterInputStream.read(FilterInputStream.java:133)
        	at java.io.FilterInputStream.read(FilterInputStream.java:133)
        	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510)
        	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        	at java.io.DataInputStream.readInt(DataInputStream.java:387)
        	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054)
        	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949)
        

        Throughout the log, I can see a lot of "IOException: An existing connection was forcibly closed by the remote host" (This issue seems to have been raised and solved in HADOOP-317, but it still occurs on Windows), and this particular exception instance results in re-sending committing message to AM.

        In addition, it seem to have a logic problem about re-sending the committing message. Given the service side has done everything, but IOException happens on the response. The second should not to be sent by the client, or should be ignored by the service. This is something we need to fix at the MR side. There're two options to fix the invalid transition:

        1. At the state machine, we ignore TA_COMMIT_PENDING when a task attempt is at COMMIT_PENDING. This is just a one-line fix.

        2. We can add RetryCache for TaskAttemptListenerImpl, when commitPending is called, we check whether we have cache for a particular attempt Id. If we have, it means the transition has been executed before, and we can skip the transition.

        Show
        Zhijie Shen added a comment - Some investigation based our logs: TA_COMMIT_PENDING at COMMIT_PENDING happens because there're two TA_COMMIT_PENDING that have been sent to TaskAttemptListenerImpl. The first one causes the state transition from RUNNING to COMMIT_PENDING, and the second one causes the invalid transition. Here's the code that notifies AM of committing the pending task. // say the task tracker that task is commit pending while ( true ) { try { umbilical.commitPending(taskId, taskStatus); break ; } catch (InterruptedException ie) { // ignore } catch (IOException ie) { LOG.warn( "Failure sending commit pending: " + StringUtils.stringifyException(ie)); if (--retries == 0) { System .exit(67); } } } The loop has been iterated twice, because the following exception happens: 2014-06-09 09:15:38,539 WARN [main] org.apache.hadoop.mapred.Task: Failure sending commit pending: java.io.IOException: Failed on local exception: java.io.IOException: An existing connection was forcibly closed by the remote host; Host Details : local host is: "sdevin-clster53/10.215.16.72" ; destination host is: "sdevin-clster54" :63415; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231) at com.sun.proxy.$Proxy9.commitPending(Unknown Source) at org.apache.hadoop.mapred.Task.done(Task.java:1006) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:397) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225) at sun.nio.ch.IOUtil.read(IOUtil.java:198) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949) Throughout the log, I can see a lot of "IOException: An existing connection was forcibly closed by the remote host" (This issue seems to have been raised and solved in HADOOP-317 , but it still occurs on Windows), and this particular exception instance results in re-sending committing message to AM. In addition, it seem to have a logic problem about re-sending the committing message. Given the service side has done everything, but IOException happens on the response. The second should not to be sent by the client, or should be ignored by the service. This is something we need to fix at the MR side. There're two options to fix the invalid transition: 1. At the state machine, we ignore TA_COMMIT_PENDING when a task attempt is at COMMIT_PENDING. This is just a one-line fix. 2. We can add RetryCache for TaskAttemptListenerImpl, when commitPending is called, we check whether we have cache for a particular attempt Id. If we have, it means the transition has been executed before, and we can skip the transition.
        Hide
        Zhijie Shen added a comment -

        Create a patch with option (1) to fix the problem quickly. And we should file a follow up ticket to go through the @AtMostOnce protocol APIs of MR, and make them use the retrycache. Once this is done, we can default this quick fix.

        Show
        Zhijie Shen added a comment - Create a patch with option (1) to fix the problem quickly. And we should file a follow up ticket to go through the @AtMostOnce protocol APIs of MR, and make them use the retrycache. Once this is done, we can default this quick fix.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12650231/MAPREDUCE-5924.1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4655//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4655//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650231/MAPREDUCE-5924.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4655//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4655//console This message is automatically generated.
        Hide
        Jian He added a comment -

        LGTM, +1

        Show
        Jian He added a comment - LGTM, +1
        Hide
        Jian He added a comment -

        committing..

        Show
        Jian He added a comment - committing..
        Hide
        Jian He added a comment -

        Committed to trunk and branch-2, Thanks Zhijie!

        Show
        Jian He added a comment - Committed to trunk and branch-2, Thanks Zhijie!
        Hide
        Jian He added a comment -

        Zhijie, can you open jira for the exception issue on Windows you mentioned? thx

        Show
        Jian He added a comment - Zhijie, can you open jira for the exception issue on Windows you mentioned? thx
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #5720 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5720/)
        MAPREDUCE-5924. Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #5720 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5720/ ) MAPREDUCE-5924 . Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #587 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/587/)
        MAPREDUCE-5924. Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #587 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/587/ ) MAPREDUCE-5924 . Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #1778 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1778/)
        MAPREDUCE-5924. Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1778 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1778/ ) MAPREDUCE-5924 . Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #1805 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1805/)
        MAPREDUCE-5924. Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1805 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1805/ ) MAPREDUCE-5924 . Changed TaskAttemptImpl to ignore TA_COMMIT_PENDING event at COMMIT_PENDING state. Contributed by Zhijie Shen (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603338 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java

          People

          • Assignee:
            Zhijie Shen
            Reporter:
            Yesha Vora
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development