Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3238

Connection timeouts to nodemanagers are retried at multiple levels

    Details

    • Target Version/s:

      Description

      The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail.

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          Since the IPC layer is already retrying it doesn't make sense to also retry at the YARN layer. Attaching a patch that removes socket connection timeouts from the list of errors we retry at the YARN layer. An alternate approach would be to retry at the YARN layer but explicitly tell the IPC layer to not retry socket timeouts when creating the proxy. This change seemed simpler and is what we've been doing all along before YARN-2613.

          Show
          jlowe Jason Lowe added a comment - Since the IPC layer is already retrying it doesn't make sense to also retry at the YARN layer. Attaching a patch that removes socket connection timeouts from the list of errors we retry at the YARN layer. An alternate approach would be to retry at the YARN layer but explicitly tell the IPC layer to not retry socket timeouts when creating the proxy. This change seemed simpler and is what we've been doing all along before YARN-2613 .
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12699964/YARN-3238.001.patch
          against trunk revision f56c65b.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6687//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6687//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699964/YARN-3238.001.patch against trunk revision f56c65b. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6687//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6687//console This message is automatically generated.
          Hide
          mitdesai Mit Desai added a comment -

          +1 (non binding)
          Looks good to me

          Show
          mitdesai Mit Desai added a comment - +1 (non binding) Looks good to me
          Hide
          xgong Xuan Gong added a comment -

          +1 LGTM. Will commit

          Show
          xgong Xuan Gong added a comment - +1 LGTM. Will commit
          Hide
          xgong Xuan Gong added a comment -

          Committed into trunk/branch-2. Thanks, Jason !

          Show
          xgong Xuan Gong added a comment - Committed into trunk/branch-2. Thanks, Jason !
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7175 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7175/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7175 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7175/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          jianhe Jian He added a comment -

          I think this is related to the RetryPolicy library we use from common module.
          the implementation of RetryPolicies.retryUpToMaximumTimeWithFixedSleep doesn't match the semantics. It should retry based on the overall time taken instead of the number of retries. HADOOP-11398 is trying to fix this.

          Show
          jianhe Jian He added a comment - I think this is related to the RetryPolicy library we use from common module. the implementation of RetryPolicies.retryUpToMaximumTimeWithFixedSleep doesn't match the semantics. It should retry based on the overall time taken instead of the number of retries. HADOOP-11398 is trying to fix this.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/846/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/846/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/)
          YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/ ) YARN-3238 . Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Hide
          sjlee0 Sangjin Lee added a comment -

          The patch applies to 2.6.0 cleanly.

          Show
          sjlee0 Sangjin Lee added a comment - The patch applies to 2.6.0 cleanly.
          Hide
          tsuna Benoit Sigoure added a comment -

          What's the setting to tune down to avoid the 45min timeout? I'd like the code to fail fast.

          Show
          tsuna Benoit Sigoure added a comment - What's the setting to tune down to avoid the 45min timeout? I'd like the code to fail fast.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation before the push. Patch applied cleanly.

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development