Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.2
    • Fix Version/s: 0.23.3, 2.0.2-alpha
    • Component/s: mrv2
    • Labels:
      None

      Description

      We saw an instance where the RM stopped launch Application masters. We found that the launcher thread was hung because something weird/bad happened to the NM node. Currently there is only 1 launcher thread (jira 4061 to fix that). We need this to not happen. Even once we increase the number of threads to > 1 if that many nodes go bad the RM would be stuck. Note that this was stuck like this for approximately 9 hours.

      Stack trace on hung AM launcher:

      "pool-1-thread-1" prio=10 tid=0x000000004343e800 nid=0x3a4c in Object.wait()
      [0x000000004fad2000]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at java.lang.Object.wait(Object.java:485)
      at org.apache.hadoop.ipc.Client.call(Client.java:1076)

      • locked <0x00002aab05a4f3f0> (a org.apache.hadoop.ipc.Client$Call)
        at
        org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:135)
        at $Proxy76.startContainer(Unknown Source)
        at
        org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:87)
        at
        org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)
        at
        org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:265)
        at
        java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
      1. MAPREDUCE-4062.patch
        30 kB
        Thomas Graves
      2. MAPREDUCE-4062-branch-0.23.patch
        30 kB
        Thomas Graves
      3. MAPREDUCE-4062.patch
        25 kB
        Thomas Graves

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1040 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1040/)
          Missed a test file as part of MAPREDUCE-4062 (Revision 1309043)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037)

          Result = FAILURE
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java

          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1040 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1040/ ) Missed a test file as part of MAPREDUCE-4062 (Revision 1309043) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043 Files : /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1005 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1005/)
          Missed a test file as part of MAPREDUCE-4062 (Revision 1309043)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037)

          Result = FAILURE
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java

          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1005 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1005/ ) Missed a test file as part of MAPREDUCE-4062 (Revision 1309043) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043 Files : /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #218 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/218/)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309046)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309045)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309046
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309045
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #218 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/218/ ) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309046) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309045) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309046 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309045 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1993 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1993/)
          Missed a test file as part of MAPREDUCE-4062 (Revision 1309043)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1993 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1993/ ) Missed a test file as part of MAPREDUCE-4062 (Revision 1309043) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043 Files : /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2055 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2055/)
          Missed a test file as part of MAPREDUCE-4062 (Revision 1309043)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2055 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2055/ ) Missed a test file as part of MAPREDUCE-4062 (Revision 1309043) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043 Files : /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1992 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1992/)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1992 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1992/ ) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1980 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1980/)
          Missed a test file as part of MAPREDUCE-4062 (Revision 1309043)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java

          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1980 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1980/ ) Missed a test file as part of MAPREDUCE-4062 (Revision 1309043) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309043 Files : /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Hide
          Robert Joseph Evans added a comment -

          Thanks Tom, I just put this into trunk, branch-2, branch-0.23

          Show
          Robert Joseph Evans added a comment - Thanks Tom, I just put this into trunk, branch-2, branch-0.23
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2054 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2054/)
          MAPREDUCE-4062. AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2054 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2054/ ) MAPREDUCE-4062 . AM Launcher thread can hang forever (tgraves via bobby) (Revision 1309037) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1309037 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/ContainerManagerPBClientImpl.java
          Hide
          Robert Joseph Evans added a comment -

          The patch for 0.23 looks good too. +1.

          Show
          Robert Joseph Evans added a comment - The patch for 0.23 looks good too. +1.
          Hide
          Robert Joseph Evans added a comment -

          I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added.

          I am going to look at the branch-0.23 patch now.

          Show
          Robert Joseph Evans added a comment - I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added. I am going to look at the branch-0.23 patch now.
          Hide
          Robert Joseph Evans added a comment -

          I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added.

          I am going to look at the branch-0.23 patch now.

          Show
          Robert Joseph Evans added a comment - I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added. I am going to look at the branch-0.23 patch now.
          Hide
          Thomas Graves added a comment -

          The tests all pass when I manually run them on both branch-0.23 and trunk. I assume they are related to the random test failures we've been seeing.

          Show
          Thomas Graves added a comment - The tests all pass when I manually run them on both branch-0.23 and trunk. I assume they are related to the random test failures we've been seeing.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12521159/MAPREDUCE-4062.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
          org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
          org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
          org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521159/MAPREDUCE-4062.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//console This message is automatically generated.
          Hide
          Thomas Graves added a comment -

          patch for trunk/branch-2. Contains updates to TestContainerLauncher.

          Show
          Thomas Graves added a comment - patch for trunk/branch-2. Contains updates to TestContainerLauncher.
          Hide
          Thomas Graves added a comment -

          patch for branch-0.23 since now different then trunk/branch-2

          Show
          Thomas Graves added a comment - patch for branch-0.23 since now different then trunk/branch-2
          Hide
          Thomas Graves added a comment -

          making changes for trunk and adding test for containerLauncher

          Show
          Thomas Graves added a comment - making changes for trunk and adding test for containerLauncher
          Hide
          Thomas Graves added a comment -

          I'll see if I can add something to TestContainerLauncher also. It is tested indirectly via TestContainerLaunchRPC. I figured that one catches both RM launch of AM as well as AM launching containers.

          I also tested things manually on a two node cluster by added sleep into the ContainerManager to simulate a hang. I verified that it properly errors out in both the RM launching AM and AM launching containers cases. In all cases it properly retried when the retry values were > 1.

          Show
          Thomas Graves added a comment - I'll see if I can add something to TestContainerLauncher also. It is tested indirectly via TestContainerLaunchRPC. I figured that one catches both RM launch of AM as well as AM launching containers. I also tested things manually on a two node cluster by added sleep into the ContainerManager to simulate a hang. I verified that it properly errors out in both the RM launching AM and AM launching containers cases. In all cases it properly retried when the retry values were > 1.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          unless there are other cases besides RPC where we'd expect the container launcher to get stuck.

          Currently the timer task's purpose is only to catch rpcs that get stuck.

          When I originally did the timer-task, I didn't really think that rpcTimeout could work. If you feel that works, sure, +1. You should definitely modify TestContainerLauncher to verify the changes with the rpc-timeout.

          Show
          Vinod Kumar Vavilapalli added a comment - unless there are other cases besides RPC where we'd expect the container launcher to get stuck. Currently the timer task's purpose is only to catch rpcs that get stuck. When I originally did the timer-task, I didn't really think that rpcTimeout could work. If you feel that works, sure, +1. You should definitely modify TestContainerLauncher to verify the changes with the rpc-timeout.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12521004/MAPREDUCE-4062.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2127//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521004/MAPREDUCE-4062.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2127//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          We've seen an issue where using both an RPC-level timeout (in this case the ping timeout) and the timer task can cause the AM to lose track of a container and hang the job. Here's the relevant part of the AM log:

          2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
          2012-03-29 07:32:17,794 WARN [ContainerLauncher #199] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
          2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
          2012-03-29 07:32:17,795 WARN [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Couldn't complete CONTAINER_REMOTE_CLEANUP on container_1333003059741_0010_01_003408/attempt_1333003059741_0010_m_003097_0. Interrupting and returning
          2012-03-29 07:32:17,798 INFO [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Stack trace of the command-thread: 
          
                  at java.util.Arrays.copyOf(Arrays.java:2882)
                  at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
                  at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
                  at java.lang.StringBuilder.append(StringBuilder.java:119)
                  at java.lang.StackTraceElement.toString(StackTraceElement.java:157)
                  at java.lang.String.valueOf(String.java:2826)
                  at java.lang.StringBuilder.append(StringBuilder.java:115)
                  at java.lang.Throwable.printStackTrace(Throwable.java:512)
                  at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:64)
                  at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:260)
                  at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:619)
          2012-03-29 07:32:17,800 WARN [ContainerLauncher #199] org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
          java.lang.InterruptedException
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
                  at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
                  at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:179)
                  at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:263)
                  at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:619)
          

          Looks like the socket timeout and the timer task timeout occurred almost simultaneously. The socket exception was caught first, and during the catch clause we fielded the interrupted exception. That broke us out of the handling of the socket exception and we never marked the container status properly before leaving.

          So I'm +1 on removing the timer task and relying on the rpcTimeout, unless there are other cases besides RPC where we'd expect the container launcher to get stuck.

          Show
          Jason Lowe added a comment - We've seen an issue where using both an RPC-level timeout (in this case the ping timeout) and the timer task can cause the AM to lose track of a container and hang the job. Here's the relevant part of the AM log: 2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x] 2012-03-29 07:32:17,794 WARN [ContainerLauncher #199] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x] 2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x] 2012-03-29 07:32:17,795 WARN [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Couldn't complete CONTAINER_REMOTE_CLEANUP on container_1333003059741_0010_01_003408/attempt_1333003059741_0010_m_003097_0. Interrupting and returning 2012-03-29 07:32:17,798 INFO [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Stack trace of the command-thread: at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at java.lang.StackTraceElement.toString(StackTraceElement.java:157) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at java.lang.Throwable.printStackTrace(Throwable.java:512) at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:64) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:260) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 2012-03-29 07:32:17,800 WARN [ContainerLauncher #199] org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:179) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:263) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Looks like the socket timeout and the timer task timeout occurred almost simultaneously. The socket exception was caught first, and during the catch clause we fielded the interrupted exception. That broke us out of the handling of the socket exception and we never marked the container status properly before leaving. So I'm +1 on removing the timer task and relying on the rpcTimeout, unless there are other cases besides RPC where we'd expect the container launcher to get stuck.
          Hide
          Thomas Graves added a comment -

          this seems to be the same issue that was seen when the AM hung launching containers in MAPREDUCE-3228. I'm investigating using an rpmTimeout when ContainerManagerPBClientImpl creates the proxy. If anyone knows a reason not to use the rpcTimeout please let me know.

          Show
          Thomas Graves added a comment - this seems to be the same issue that was seen when the AM hung launching containers in MAPREDUCE-3228 . I'm investigating using an rpmTimeout when ContainerManagerPBClientImpl creates the proxy. If anyone knows a reason not to use the rpcTimeout please let me know.

            People

            • Assignee:
              Thomas Graves
              Reporter:
              Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development