Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3031

Job Client goes into infinite loop when we kill AM

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Started a cluster. Submitted a sleep job with around 10000 maps and 1000 reduces.
      Killed AM with kill -9 by which time already 7000 thousands maps got completed.

      On the RM webUI, Application is stuck in Application.RUNNING state. And JobClient goes into an infinite loop as RM keeps telling the client that the application is running.

      1. MR3031_v1.patch
        8 kB
        Siddharth Seth

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          This is a bug in NM and just about any container which is killed like this(doing a kill $pid on the node) will be stuck at RUNNING state on the RM. I found this on the corresponding NM:

          org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_KILLED_ON_REQUEST at RUNNING
                  at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:297)
                  at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:39)
                  at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:439)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:685)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:69)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:356)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:349)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
                  at java.lang.Thread.run(Thread.java:619)
          

          This is because an exit code of 137/143 is treated as a kill request. On hind sight it turns out this is a bad idea, we should fix this.

          Show
          Vinod Kumar Vavilapalli added a comment - This is a bug in NM and just about any container which is killed like this(doing a kill $pid on the node) will be stuck at RUNNING state on the RM. I found this on the corresponding NM: org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_KILLED_ON_REQUEST at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:297) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:439) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:685) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:69) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:356) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:349) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang. Thread .run( Thread .java:619) This is because an exit code of 137/143 is treated as a kill request. On hind sight it turns out this is a bad idea, we should fix this.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Sid, can you please take this up? Thanks!

          Show
          Vinod Kumar Vavilapalli added a comment - Sid, can you please take this up? Thanks!
          Hide
          Siddharth Seth added a comment -

          Failing the container in case of a kill -9 while in RUNNING state.
          Also sending out evenst to ContainresLauncher so that ContainersLauncher.running doesn't keep growing for containers which exit on their own.

          Show
          Siddharth Seth added a comment - Failing the container in case of a kill -9 while in RUNNING state. Also sending out evenst to ContainresLauncher so that ContainersLauncher.running doesn't keep growing for containers which exit on their own.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12496345/MR3031_v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/843//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/843//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12496345/MR3031_v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/843//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/843//console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          +1, lgtm.

          There are a couple of missing transitions on ContainerState.LOCALIZATION_FAILED, not really caused by/part of this patch. Will track them separately.

          Show
          Vinod Kumar Vavilapalli added a comment - +1, lgtm. There are a couple of missing transitions on ContainerState.LOCALIZATION_FAILED, not really caused by/part of this patch. Will track them separately.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I just committed this to trunk and branch-0.23. Thanks Sid!

          Show
          Vinod Kumar Vavilapalli added a comment - I just committed this to trunk and branch-0.23. Thanks Sid!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #956 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/956/)
          MAPREDUCE-3031. Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth.

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #956 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/956/ ) MAPREDUCE-3031 . Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1034 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1034/)
          MAPREDUCE-3031. Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth.

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1034 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1034/ ) MAPREDUCE-3031 . Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #974 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/974/)
          MAPREDUCE-3031. Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth.

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #974 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/974/ ) MAPREDUCE-3031 . Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #28 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/28/)
          MAPREDUCE-3031. svn merge -c r1175960 --ignore-ancestry ../../trunk/

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175964
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #28 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/28/ ) MAPREDUCE-3031 . svn merge -c r1175960 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175964 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #22 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/22/)
          MAPREDUCE-3031. svn merge -c r1175960 --ignore-ancestry ../../trunk/

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175964
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #22 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/22/ ) MAPREDUCE-3031 . svn merge -c r1175960 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175964 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #813 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/813/)
          MAPREDUCE-3031. Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth.

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #813 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/813/ ) MAPREDUCE-3031 . Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #843 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/843/)
          MAPREDUCE-3031. Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth.

          vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #843 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/843/ ) MAPREDUCE-3031 . Proper handling of killed containers to prevent stuck containers/AMs on an external kill signal. Contributed by Siddharth Seth. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175960 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java

            People

            • Assignee:
              Siddharth Seth
              Reporter:
              Karam Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development