Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3809

Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

    Details

    • Hadoop Flags:
      Reviewed

      Description

      ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP).

      In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out.

      1. YARN-3809.01.patch
        3 kB
        Jun Gong
      2. YARN-3809.02.patch
        6 kB
        Jun Gong
      3. YARN-3809.03.patch
        6 kB
        Jun Gong

        Issue Links

          Activity

          Hide
          hex108 Jun Gong added a comment -

          It seems NMClient also needs same change.

          Show
          hex108 Jun Gong added a comment - It seems NMClient also needs same change.
          Hide
          lakshman Laxman added a comment -

          Thanks Rohith Sharma K S. Filed a new jira MAPREDUCE-6659. As I feel it requires MR app master changes, I filed issue under MR project.

          Show
          lakshman Laxman added a comment - Thanks Rohith Sharma K S . Filed a new jira MAPREDUCE-6659 . As I feel it requires MR app master changes, I filed issue under MR project.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Currently NMProxy wait for infinite times i.e 0 is set by default from NMProxy. This is the reason all your AM-NM calls are hung. This can be controlled by configuring "ipc.client.connect.max.retries.on.timeouts".
          May be new configuration can be added with name yarn.client.connect.max.retries.on.timeouts to control NMProxy connections only. Can you raise new ticket for this?

          Show
          rohithsharma Rohith Sharma K S added a comment - Currently NMProxy wait for infinite times i.e 0 is set by default from NMProxy. This is the reason all your AM-NM calls are hung. This can be controlled by configuring "ipc.client.connect.max.retries.on.timeouts". May be new configuration can be added with name yarn.client.connect.max.retries.on.timeouts to control NMProxy connections only. Can you raise new ticket for this?
          Hide
          lakshman Laxman added a comment -

          We are using hadoop 2.6.0. And we are facing the similar issue with AM-NM RPC protocol as well. After AM notified with a lost NM, I see AM threads hung in the code containerMgrProxy.stopContainers(stopRequest). Current patch issue fixes only RM-AM protocol. Should we have similar fix for AM-NM as well?

          Show
          lakshman Laxman added a comment - We are using hadoop 2.6.0. And we are facing the similar issue with AM-NM RPC protocol as well. After AM notified with a lost NM, I see AM threads hung in the code containerMgrProxy.stopContainers(stopRequest). Current patch issue fixes only RM-AM protocol. Should we have similar fix for AM-NM as well?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/969/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/969/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/CHANGES.txt
          Hide
          hex108 Jun Gong added a comment -

          Thanks Rohith Sharma K S, Devaraj K and Karthik Kambatla for comments and review. Thanks Jason Lowe for suggestions, review and committing the patch.

          Show
          hex108 Jun Gong added a comment - Thanks Rohith Sharma K S , Devaraj K and Karthik Kambatla for comments and review. Thanks Jason Lowe for suggestions, review and committing the patch.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8057 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8057/)
          YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8057 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8057/ ) YARN-3809 . Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Jun for the contribution and to Devaraj and Karthik for additional review! I committed this to trunk, branch-2, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks to Jun for the contribution and to Devaraj and Karthik for additional review! I committed this to trunk, branch-2, and branch-2.7.
          Hide
          jlowe Jason Lowe added a comment -

          +1 latest patch lgtm. Will commit this later today if there are no objections.

          Show
          jlowe Jason Lowe added a comment - +1 latest patch lgtm. Will commit this later today if there are no objections.
          Hide
          hex108 Jun Gong added a comment -

          Same as previous explanation, checkstyle and test case error are not related.

          Show
          hex108 Jun Gong added a comment - Same as previous explanation, checkstyle and test case error are not related.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 18m 48s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 7m 39s There were no new javac warning messages.
          +1 javadoc 9m 42s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 52s The applied patch generated 1 new checkstyle issues (total was 211, now 211).
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 36s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 4m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 25s Tests passed in hadoop-yarn-api.
          +1 yarn tests 1m 57s Tests passed in hadoop-yarn-common.
          -1 yarn tests 50m 45s Tests failed in hadoop-yarn-server-resourcemanager.
              98m 47s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / bcb3c40
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8299/testReport/
          Java 1.7.0_55
          uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8299/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 18m 48s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 39s There were no new javac warning messages. +1 javadoc 9m 42s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 52s The applied patch generated 1 new checkstyle issues (total was 211, now 211). +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 36s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 4m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 25s Tests passed in hadoop-yarn-api. +1 yarn tests 1m 57s Tests passed in hadoop-yarn-common. -1 yarn tests 50m 45s Tests failed in hadoop-yarn-server-resourcemanager.     98m 47s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / bcb3c40 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8299/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8299/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8299/console This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch to address Jason Lowe 's suggestions. Thanks for the review.

          Show
          hex108 Jun Gong added a comment - Attach a new patch to address Jason Lowe 's suggestions. Thanks for the review.
          Hide
          hex108 Jun Gong added a comment -

          Jason Lowe Thank you for the very detailed suggestions. It helps me a lot.

          I will update patch to address your suggestions. I find it might be better to change config in ApplicationMasterLauncher#serviceInit, then we just need change it one time, all newly created AMLauncher will use this config.

          Show
          hex108 Jun Gong added a comment - Jason Lowe Thank you for the very detailed suggestions. It helps me a lot. I will update patch to address your suggestions. I find it might be better to change config in ApplicationMasterLauncher#serviceInit, then we just need change it one time, all newly created AMLauncher will use this config.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch, Jun.

          ContainerManagementProtocolPBClientImpl is not the appropriate place to make this change. That is used by every client of the ContainerManagementProtocol, which includes AMs, etc. AMLauncher.getContainerMgrProxy is a more appropriate place, although there we still don't want to create a new config every time we create an NM client proxy. Creating configs is expensive. An even better place to put this change is in the AMLauncher constructor, where it can create a copy of the conf then set the IPC config in it, which in turn will be passed to the YarnRPC code to create the proxy.

          The new properties should have entries in yarn-default.xml for documentation purposes.

          Nit: "container-management" is probably not going to be clear to most users what it means. I think it would be clearer to use "nodemanager" since that's used in many other places, so I suggest a property name like "yarn.resourcemanager.nodemanager-connect-retries".

          Since we're changing the code that sets up the AM launcher thread pool, it'd be really nice to give each of the threads in that pool a descriptive name. It's annoying to see a bunch of threads like "pool-1-thread-2" in the jstack all waiting for work but no clue in the name or stack as to what service they belong to.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch, Jun. ContainerManagementProtocolPBClientImpl is not the appropriate place to make this change. That is used by every client of the ContainerManagementProtocol, which includes AMs, etc. AMLauncher.getContainerMgrProxy is a more appropriate place, although there we still don't want to create a new config every time we create an NM client proxy. Creating configs is expensive. An even better place to put this change is in the AMLauncher constructor, where it can create a copy of the conf then set the IPC config in it, which in turn will be passed to the YarnRPC code to create the proxy. The new properties should have entries in yarn-default.xml for documentation purposes. Nit: "container-management" is probably not going to be clear to most users what it means. I think it would be clearer to use "nodemanager" since that's used in many other places, so I suggest a property name like "yarn.resourcemanager.nodemanager-connect-retries". Since we're changing the code that sets up the AM launcher thread pool, it'd be really nice to give each of the threads in that pool a descriptive name. It's annoying to see a bunch of threads like "pool-1-thread-2" in the jstack all waiting for work but no clue in the name or stack as to what service they belong to.
          Hide
          hex108 Jun Gong added a comment -

          Jason Lowe Thanks for explanation and suggestions. I misunderstood Karthik Kambatla 's suggestion. I thought we just need configure IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_KEY to a smaller value, however it will pollute other proies.

          The new patch addresses your suggestion: update ipc.client.connect.max.retries.on.timeouts to a new introduced config yarn.resourcemanager..container-management.retries whose default value is 10. Then default retry time will be 200 secs.

          Show
          hex108 Jun Gong added a comment - Jason Lowe Thanks for explanation and suggestions. I misunderstood Karthik Kambatla 's suggestion. I thought we just need configure IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_KEY to a smaller value, however it will pollute other proies. The new patch addresses your suggestion: update ipc.client.connect.max.retries.on.timeouts to a new introduced config yarn.resourcemanager..container-management.retries whose default value is 10. Then default retry time will be 200 secs.
          Hide
          jlowe Jason Lowe added a comment -

          I agree the thread pool should be configurable and possibly larger by default, but note that the thread pool size has little to do with the number of AMs running on a node. The pool size needs to be as large as the worst-case number of AMs being launched during the worst-case retry duration to avoid any unnecessary delays. For a 15 minute retry delay on a cluster launching multiple apps per second on average, that's an unreasonable thread pool size. I agree with Karthik Kambatla that we need to lower the retry timeout as part of this fix. As it is today we will expire the NM due to lack of heartbeat before we will give up on an AM launch retry which makes no sense.

          We can update ipc.client.connect.max.retries.on.timeouts and potentially ipc.client.connect.timeout for the conf passed to the NM proxy we create, although we need to make sure we make a copy of the config to avoid polluting other proxies.

          Show
          jlowe Jason Lowe added a comment - I agree the thread pool should be configurable and possibly larger by default, but note that the thread pool size has little to do with the number of AMs running on a node. The pool size needs to be as large as the worst-case number of AMs being launched during the worst-case retry duration to avoid any unnecessary delays. For a 15 minute retry delay on a cluster launching multiple apps per second on average, that's an unreasonable thread pool size. I agree with Karthik Kambatla that we need to lower the retry timeout as part of this fix. As it is today we will expire the NM due to lack of heartbeat before we will give up on an AM launch retry which makes no sense. We can update ipc.client.connect.max.retries.on.timeouts and potentially ipc.client.connect.timeout for the conf passed to the NM proxy we create, although we need to make sure we make a copy of the config to avoid polluting other proxies.
          Hide
          hex108 Jun Gong added a comment -

          Devaraj K and Karthik Kambatla, thank you for the comments and suggestions.

          Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node?By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value?

          Threads in the pool are just launching/stopping AMs, so it will be better that the number of threads in the pool is at least as big as the maximum number of AMs that could run on a node. Although we could not know the max value for all clusters in advance, a larger value will make it faster that deal with AMLauncher events. Admins could just pick the default value, and they could adjust the value if they find the value is a little small.

          Or, could we make it so we don't wait as long as 15 minutes?

          Yes, we could make it shorter. I think we also need a larger thread pool, then it could deal with more events at the same time.

          Show
          hex108 Jun Gong added a comment - Devaraj K and Karthik Kambatla , thank you for the comments and suggestions. Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node?By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value? Threads in the pool are just launching/stopping AMs, so it will be better that the number of threads in the pool is at least as big as the maximum number of AMs that could run on a node. Although we could not know the max value for all clusters in advance, a larger value will make it faster that deal with AMLauncher events. Admins could just pick the default value, and they could adjust the value if they find the value is a little small. Or, could we make it so we don't wait as long as 15 minutes? Yes, we could make it shorter. I think we also need a larger thread pool, then it could deal with more events at the same time.
          Hide
          kasha Karthik Kambatla added a comment -

          Or, could we make it so we don't wait as long as 15 minutes?

          Show
          kasha Karthik Kambatla added a comment - Or, could we make it so we don't wait as long as 15 minutes?
          Hide
          kasha Karthik Kambatla added a comment -

          Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node? By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value?

          Show
          kasha Karthik Kambatla added a comment - Shouldn't the number of threads in the pool be at least as big as the maximum number of apps that could run on a node? By making it configurable, how do we expect the admins to pick this number? Just pick an arbitrarily high value?
          Hide
          devaraj.k Devaraj K added a comment -

          The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it?

          The applied patch generated 1 new checkstyle issues (total was 213, now 213).

          I think you don't need to fix this checkstyle as part of this issue as the checkstyle count is same after the patch, and the new checkstyle which is showing about file length would probably take reasonable effort to refactor it and exists without the patch as well.

          Show
          devaraj.k Devaraj K added a comment - The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it? The applied patch generated 1 new checkstyle issues (total was 213, now 213). I think you don't need to fix this checkstyle as part of this issue as the checkstyle count is same after the patch, and the new checkstyle which is showing about file length would probably take reasonable effort to refactor it and exists without the patch as well.
          Hide
          hex108 Jun Gong added a comment -

          The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it?

          Test case error is addressed in YARN-3790.

          Show
          hex108 Jun Gong added a comment - The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines (max allowed is 2,000). Need fix it? Test case error is addressed in YARN-3790 .
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 17m 31s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 7m 37s There were no new javac warning messages.
          +1 javadoc 9m 42s There were no new javadoc warning messages.
          +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 21s The applied patch generated 1 new checkstyle issues (total was 213, now 213).
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 35s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 2m 54s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 22s Tests passed in hadoop-yarn-api.
          -1 yarn tests 50m 49s Tests failed in hadoop-yarn-server-resourcemanager.
              93m 4s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12739875/YARN-3809.01.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / b039e69
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8261/testReport/
          Java 1.7.0_55
          uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8261/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 31s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 37s There were no new javac warning messages. +1 javadoc 9m 42s There were no new javadoc warning messages. +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 21s The applied patch generated 1 new checkstyle issues (total was 213, now 213). +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 35s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 2m 54s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 22s Tests passed in hadoop-yarn-api. -1 yarn tests 50m 49s Tests failed in hadoop-yarn-server-resourcemanager.     93m 4s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12739875/YARN-3809.01.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / b039e69 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8261/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8261/console This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Attach a patch. Make thread pool size configurable, and default size is 50.

          Show
          hex108 Jun Gong added a comment - Attach a patch. Make thread pool size configurable, and default size is 50.
          Hide
          hex108 Jun Gong added a comment -

          The stack is as following:

          2015-06-15 11:16:35,376 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error cleaning master 
          org.apache.hadoop.net.ConnectTimeoutException: Call From docker-10-240-139-221/10.240.139.221 to docker-10-240-139-234:8041 failed on socket timeout exception: org.apache.h
          adoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=do
          cker-10-240-139-234/10.240.139.234:8041]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
                  at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source)
                  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                  at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
                  at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
                  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
                  at org.apache.hadoop.ipc.Client.call(Client.java:1414)
                  at org.apache.hadoop.ipc.Client.call(Client.java:1363)
                  at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
                  at com.sun.proxy.$Proxy36.stopContainers(Unknown Source)
                  at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
                  at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:138)
                  at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:263)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=docker-10-240-139-234/10.240.139.234:8041]
                  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
                  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
                  at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
                  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
                  at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
                  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
                  at org.apache.hadoop.ipc.Client.call(Client.java:1381)
                  ... 9 more
          

          Time out is 20 secs, but it will retry 45 times(IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT = 45).

          Show
          hex108 Jun Gong added a comment - The stack is as following: 2015-06-15 11:16:35,376 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error cleaning master org.apache.hadoop.net.ConnectTimeoutException: Call From docker-10-240-139-221/10.240.139.221 to docker-10-240-139-234:8041 failed on socket timeout exception: org.apache.h adoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=do cker-10-240-139-234/10.240.139.234:8041]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy36.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:138) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:263) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=docker-10-240-139-234/10.240.139.234:8041] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462) at org.apache.hadoop.ipc.Client.call(Client.java:1381) ... 9 more Time out is 20 secs, but it will retry 45 times(IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT = 45).
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          This is interesting scenario, but am not sure why ThreadPool is set to 10 which is not configurable.

          the default RPC time out is 15 mins..

          I see RPC timeout is 1 minute, am I missing anything?

          static final int DEFAULT_COMMAND_TIMEOUT = 60000;
          .....
            int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT);
              proxy =
                  (ContainerManagementProtocolPB) RPC.getProxy(ContainerManagementProtocolPB.class,
                    clientVersion, addr, ugi, conf,
                    NetUtils.getDefaultSocketFactory(conf), expireIntvl);
          
          Show
          rohithsharma Rohith Sharma K S added a comment - This is interesting scenario, but am not sure why ThreadPool is set to 10 which is not configurable. the default RPC time out is 15 mins.. I see RPC timeout is 1 minute, am I missing anything? static final int DEFAULT_COMMAND_TIMEOUT = 60000; ..... int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT); proxy = (ContainerManagementProtocolPB) RPC.getProxy(ContainerManagementProtocolPB.class, clientVersion, addr, ugi, conf, NetUtils.getDefaultSocketFactory(conf), expireIntvl);
          Hide
          hex108 Jun Gong added a comment -

          How about setting thread pool size in ApplicationMasterLauncher larger, or make the size configurable?

          Show
          hex108 Jun Gong added a comment - How about setting thread pool size in ApplicationMasterLauncher larger, or make the size configurable?

            People

            • Assignee:
              hex108 Jun Gong
              Reporter:
              hex108 Jun Gong
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development