Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3554

Default value for maximum nodemanager connect wait time is too high

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The default value for yarn.client.nodemanager-connect.max-wait-ms is 900000 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes.

      1. YARN-3554.20150429-1.patch
        2 kB
        Naganarasimha G R
      2. YARN-3554-20150429-2.patch
        2 kB
        Naganarasimha G R

        Issue Links

          Activity

          Hide
          sjlee0 Sangjin Lee added a comment -

          The change applied cleanly to branch-2.6 for 2.6.2.

          Show
          sjlee0 Sangjin Lee added a comment - The change applied cleanly to branch-2.6 for 2.6.2.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #189 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/189/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #189 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/189/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2137 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2137/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2137 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2137/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7775 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7775/)
          YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7775 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7775/ ) YARN-3554 . Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Thanks for reviewing and committing the patch Jason Lowe, Vinod Kumar Vavilapalli & Li Lu

          Show
          Naganarasimha Naganarasimha G R added a comment - Thanks for reviewing and committing the patch Jason Lowe , Vinod Kumar Vavilapalli & Li Lu
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Naganarasimha G R for the contribution and to Vinod, Li Lu, and sandflee for additional review! I committed this to trunk, branch-2, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks to Naganarasimha G R for the contribution and to Vinod, Li Lu, and sandflee for additional review! I committed this to trunk, branch-2, and branch-2.7.
          Hide
          jlowe Jason Lowe added a comment -

          +1, committing this.

          Show
          jlowe Jason Lowe added a comment - +1, committing this.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jason Lowe, As 3 mins is fine with Vinod Kumar Vavilapalli can we have this patch in ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jason Lowe , As 3 mins is fine with Vinod Kumar Vavilapalli can we have this patch in ?
          Hide
          gtCarrera9 Li Lu added a comment -

          I think the current conclusion on HADOOP-11398 is that we need to make some non-trivial changes in the retry mechanism to make it work accurately. We may want to have some quick fix before that.

          Show
          gtCarrera9 Li Lu added a comment - I think the current conclusion on HADOOP-11398 is that we need to make some non-trivial changes in the retry mechanism to make it work accurately. We may want to have some quick fix before that.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          if there are plans to get HADOOP-11398 faster then we can directly modify yarn.client.nodemanager-connect.max-wait-ms to 3mins, if not i feel it would be better to modify it to 1min and drop a comment in HADOOP-11398 to remodify this value back to 3 once its in . thoughts?

          Show
          Naganarasimha Naganarasimha G R added a comment - if there are plans to get HADOOP-11398 faster then we can directly modify yarn.client.nodemanager-connect.max-wait-ms to 3mins, if not i feel it would be better to modify it to 1min and drop a comment in HADOOP-11398 to remodify this value back to 3 once its in . thoughts?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I am good with either. I think the more important fix is HADOOP-11398 to be able to configure things in a predictable manner.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I am good with either. I think the more important fix is HADOOP-11398 to be able to configure things in a predictable manner.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Vinod Kumar Vavilapalli & Jason Lowe,
          So would configuring yarn.client.nodemanager-connect.max-wait-ms as 1 min better ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Vinod Kumar Vavilapalli & Jason Lowe , So would configuring yarn.client.nodemanager-connect.max-wait-ms as 1 min better ?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals.

          For our users, we explicitly set yarn.client.nodemanager-connect.max-wait-ms to 60,000 (one minute). As HADOOP-11398 is still not in, this ends up becoming 6 minutes timeout (assuming each of the underlying rpc retries takes 1 sec * 50 times to finish (50 secs), plus 10 seconds retry interval, causing 1min per retry and 6 retries overall).

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals. For our users, we explicitly set yarn.client.nodemanager-connect.max-wait-ms to 60,000 (one minute). As HADOOP-11398 is still not in, this ends up becoming 6 minutes timeout (assuming each of the underlying rpc retries takes 1 sec * 50 times to finish (50 secs), plus 10 seconds retry interval, causing 1min per retry and 6 retries overall).
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          HADOOP-11398 and YARN-3238 relevant in that they caused AM->NM communication take a long time to timeout.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - HADOOP-11398 and YARN-3238 relevant in that they caused AM->NM communication take a long time to timeout.
          Hide
          jlowe Jason Lowe added a comment -

          Ah, thanks Naganarasimha G R, sorry I missed that. We can continue discussing the proper RM connect wait time over at YARN-3518, as obviously I cannot keep them straight here.

          Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals.

          Show
          jlowe Jason Lowe added a comment - Ah, thanks Naganarasimha G R , sorry I missed that. We can continue discussing the proper RM connect wait time over at YARN-3518 , as obviously I cannot keep them straight here. Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jason Lowe,
          earlier my query of ideal time and sandflee's comment is related to "yarn.resourcemanager.connect.max-wait.ms" and as Li Lu (inactive) mentioned its just for discussion purpose.

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jason Lowe , earlier my query of ideal time and sandflee 's comment is related to "yarn.resourcemanager.connect.max-wait.ms" and as Li Lu (inactive) mentioned its just for discussion purpose.
          Hide
          jlowe Jason Lowe added a comment -

          YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two.

          set this to a bigger value maybe based on network partition considerations not only for nm restart.

          What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time.

          3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected.

          This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node?

          Show
          jlowe Jason Lowe added a comment - YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two. set this to a bigger value maybe based on network partition considerations not only for nm restart. What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected. This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node?
          Hide
          gtCarrera9 Li Lu added a comment -

          Hi Naganarasimha G R, I just wanted to bring that JIRA into attention. We may want to share some discussions for both JIRAs.

          Show
          gtCarrera9 Li Lu added a comment - Hi Naganarasimha G R , I just wanted to bring that JIRA into attention. We may want to share some discussions for both JIRAs.
          Hide
          sandflee sandflee added a comment -

          Hi Naganarasimha G R 3 mins seems dangerous, If rm fails over and the recover takes serval mins
          , nm maybe kill all containers, in production env, it's not expected.

          Show
          sandflee sandflee added a comment - Hi Naganarasimha G R 3 mins seems dangerous, If rm fails over and the recover takes serval mins , nm maybe kill all containers, in production env, it's not expected.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Li Lu,
          Thanks for commenting on this jira but did not get the intention completely, whether you are expecting me to merge the changes required for 3518 here ?
          if so i had few questions
          1. yarn-3518 tries to modify default value of yarn.resourcemanager.connect.max-wait.ms from 900000 to 600000, which not only impacts timeout from AM - RM but also NM - RM and client(cli, web, application report etc..) - RM. Is that ok ? (I am ok with it but just wanted to point it out)
          2. Given the current high availability, is it required to wait for 10 mins to detect that RM has failed is valid or shall i decrease that too to 3 mins ?

          If you inform i can merge the changes of 3518 and also update in yarn-default.xml which is missing in 3518.

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Li Lu , Thanks for commenting on this jira but did not get the intention completely, whether you are expecting me to merge the changes required for 3518 here ? if so i had few questions 1. yarn-3518 tries to modify default value of yarn.resourcemanager.connect.max-wait.ms from 900000 to 600000, which not only impacts timeout from AM - RM but also NM - RM and client(cli, web, application report etc..) - RM. Is that ok ? (I am ok with it but just wanted to point it out) 2. Given the current high availability, is it required to wait for 10 mins to detect that RM has failed is valid or shall i decrease that too to 3 mins ? If you inform i can merge the changes of 3518 and also update in yarn-default.xml which is missing in 3518.
          Hide
          sandflee sandflee added a comment -

          set this to a bigger value maybe based on network partition considerations not only for nm restart.

          Show
          sandflee sandflee added a comment - set this to a bigger value maybe based on network partition considerations not only for nm restart.
          Hide
          gtCarrera9 Li Lu added a comment -

          This is just a quick note that there's another JIRA YARN-3518 that fixes the similar problem for the RM. We may probably want to consider both of them.

          Show
          gtCarrera9 Li Lu added a comment - This is just a quick note that there's another JIRA YARN-3518 that fixes the similar problem for the RM. We may probably want to consider both of them.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 34s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 javac 7m 30s There were no new javac warning messages.
          +1 javadoc 9m 38s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 7m 35s There were no new checkstyle issues.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 2m 46s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          +1 yarn tests 0m 30s Tests passed in hadoop-yarn-api.
          +1 yarn tests 1m 55s Tests passed in hadoop-yarn-common.
              46m 59s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12729227/YARN-3554-20150429-2.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 8f82970
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7543/testReport/
          Java 1.7.0_55
          uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/7543/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 34s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 javac 7m 30s There were no new javac warning messages. +1 javadoc 9m 38s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 7m 35s There were no new checkstyle issues. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 2m 46s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 yarn tests 0m 30s Tests passed in hadoop-yarn-api. +1 yarn tests 1m 55s Tests passed in hadoop-yarn-common.     46m 59s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12729227/YARN-3554-20150429-2.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 8f82970 hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-common.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7543/testReport/ Java 1.7.0_55 uname Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/7543/console This message was automatically generated.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jason Lowe Updating with 3 minutes as the timeout

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jason Lowe Updating with 3 minutes as the timeout
          Hide
          jlowe Jason Lowe added a comment -

          I suggest we go with 3 minutes. The retry interval is 10 seconds, so we'll get plenty of retries in that time if the failure is fast (e.g.: unknown host, connection refused) and still get a few retries in if the failure is slow (e.g.: connection timeout).

          Show
          jlowe Jason Lowe added a comment - I suggest we go with 3 minutes. The retry interval is 10 seconds, so we'll get plenty of retries in that time if the failure is fast (e.g.: unknown host, connection refused) and still get a few retries in if the failure is slow (e.g.: connection timeout).
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Agree with you Jason Lowe, but what do you feel the ideal timeout should be, 3 mins / 5 mins ? May be as you guys would have better experience with large number of nodes and see frequent NM failures you can suggest a better value here .

          Show
          Naganarasimha Naganarasimha G R added a comment - Agree with you Jason Lowe , but what do you feel the ideal timeout should be, 3 mins / 5 mins ? May be as you guys would have better experience with large number of nodes and see frequent NM failures you can suggest a better value here .
          Hide
          jlowe Jason Lowe added a comment -

          I think 10 minutes is still too high. We didn't even have this functionality until 2.6 because of rolling upgrades, and NMs don't take that long to recover in a rolling upgrade. They recover in tens of seconds rather than tens of minutes. Therefore I don't think it makes much sense to spend a lot of time trying to connect to an NM beyond a few minutes. The chances of successfully connecting after a few minutes of trying is going to be very low, and NMs fail all the time anyway. So if we spend all that extra time trying for essentially no benefit, all we've done is prolonged the application recovery time for no good reason.

          Show
          jlowe Jason Lowe added a comment - I think 10 minutes is still too high. We didn't even have this functionality until 2.6 because of rolling upgrades, and NMs don't take that long to recover in a rolling upgrade. They recover in tens of seconds rather than tens of minutes. Therefore I don't think it makes much sense to spend a lot of time trying to connect to an NM beyond a few minutes. The chances of successfully connecting after a few minutes of trying is going to be very low, and NMs fail all the time anyway. So if we spend all that extra time trying for essentially no benefit, all we've done is prolonged the application recovery time for no good reason.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 15m 14s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 javac 7m 48s There were no new javac warning messages.
          +1 javadoc 9m 59s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 5m 32s There were no new checkstyle issues.
          +1 install 1m 34s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 2m 48s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          +1 yarn tests 0m 21s Tests passed in hadoop-yarn-api.
          +1 yarn tests 1m 58s Tests passed in hadoop-yarn-common.
              46m 11s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12728970/YARN-3554.20150429-1.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / c79e7f7
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7531/testReport/
          Java 1.7.0_55
          uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/7531/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 15m 14s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 javac 7m 48s There were no new javac warning messages. +1 javadoc 9m 59s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 5m 32s There were no new checkstyle issues. +1 install 1m 34s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 2m 48s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 yarn tests 0m 21s Tests passed in hadoop-yarn-api. +1 yarn tests 1m 58s Tests passed in hadoop-yarn-common.     46m 11s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12728970/YARN-3554.20150429-1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / c79e7f7 hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-common.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7531/testReport/ Java 1.7.0_55 uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/7531/console This message was automatically generated.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Jason Lowe, as it was simple fix i have assigned to my self, hope you were not already working on it
          I think earlier intention was to sync with yarn.resourcemanager.connect.max-wait.ms which is also 900000. but i feel it should be 10 mins as per the description given by you. Attaching the patch for the same

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Jason Lowe , as it was simple fix i have assigned to my self, hope you were not already working on it I think earlier intention was to sync with yarn.resourcemanager.connect.max-wait.ms which is also 900000. but i feel it should be 10 mins as per the description given by you. Attaching the patch for the same

            People

            • Assignee:
              Naganarasimha Naganarasimha G R
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development