Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3842

NMProxy should retry on NMNotYetReadyException

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 2.8.0, 2.7.1, 2.6.4, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Consider the following scenario:
      1. RM assigns a container on node N to an app A.
      2. Node N is restarted
      3. A tries to launch container on node N.

      3 could lead to an NMNotYetReadyException depending on whether NM N has registered with the RM. In MR, this is considered a task attempt failure. A few of these could lead to a task/job failure.

      1. YARN-3842.002.patch
        4 kB
        Robert Kanter
      2. YARN-3842.001.patch
        5 kB
        Robert Kanter
      3. MAPREDUCE-6409.002.patch
        13 kB
        Robert Kanter
      4. MAPREDUCE-6409.001.patch
        13 kB
        Robert Kanter

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          I committed this to branch-2.6.

          Show
          jlowe Jason Lowe added a comment - I committed this to branch-2.6.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Running into this in a couple of places, we should get this into 2.6.3.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Running into this in a couple of places, we should get this into 2.6.3.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/967/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/967/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8050 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8050/)
          YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8050 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8050/ ) YARN-3842 . NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java hadoop-yarn-project/CHANGES.txt
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks everyone for your inputs on this, and Robert for your patch.

          Just committed this to trunk, branch-2, and branch-2.7.

          Show
          kasha Karthik Kambatla added a comment - Thanks everyone for your inputs on this, and Robert for your patch. Just committed this to trunk, branch-2, and branch-2.7.
          Hide
          hadoopqa Hadoop QA added a comment -



          +1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 17m 16s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 38s There were no new javac warning messages.
          +1 javadoc 9m 37s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 1m 28s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 35s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 2m 47s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 1m 56s Tests passed in hadoop-yarn-common.
          +1 yarn tests 6m 27s Tests passed in hadoop-yarn-server-nodemanager.
              49m 43s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12741154/YARN-3842.002.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 077250d
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8317/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8317/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8317/testReport/
          Java 1.7.0_55
          uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8317/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 16s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 38s There were no new javac warning messages. +1 javadoc 9m 37s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 28s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 35s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 2m 47s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 1m 56s Tests passed in hadoop-yarn-common. +1 yarn tests 6m 27s Tests passed in hadoop-yarn-server-nodemanager.     49m 43s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741154/YARN-3842.002.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 077250d hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8317/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8317/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8317/testReport/ Java 1.7.0_55 uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8317/console This message was automatically generated.
          Hide
          kasha Karthik Kambatla added a comment -

          +1, pending Jenkins.

          Thanks for your review, Jian He. I ll go ahead commit this if Jenkins is fine with it.

          Show
          kasha Karthik Kambatla added a comment - +1, pending Jenkins. Thanks for your review, Jian He . I ll go ahead commit this if Jenkins is fine with it.
          Hide
          jianhe Jian He added a comment -

          I think the latest patch is safe for 2.7.1, +1

          Show
          jianhe Jian He added a comment - I think the latest patch is safe for 2.7.1, +1
          Hide
          rkanter Robert Kanter added a comment -

          The new patch make the changes Karthik suggested. I also added a few comments and renamed isExpectingNMNotYetReadyException to shouldThrowNMNotYetReadyException for clarity.

          Show
          rkanter Robert Kanter added a comment - The new patch make the changes Karthik suggested. I also added a few comments and renamed isExpectingNMNotYetReadyException to shouldThrowNMNotYetReadyException for clarity.
          Hide
          rkanter Robert Kanter added a comment -

          I had sort of just split startContainers into two sections (one for each part of the test), but this is a lot more concise. I'll do that.

          Show
          rkanter Robert Kanter added a comment - I had sort of just split startContainers into two sections (one for each part of the test), but this is a lot more concise. I'll do that.
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks for the quick turnaround on this, Robert.

          One nit-pick on the test: would the following be more concise?

                  if (retryCount < 5) {
                    retryCount++;
                    if (isExpectingNMNotYetReadyException) {
                      containerManager.setBlockNewContainerRequests(true);
                    } else {
                      throw new java.net.ConnectException("start container exception");
                    }
                  } else {
                    containerManager.setBlockNewContainerRequests(false);
                  }
                  return super.startContainers(requests);
          
          Show
          kasha Karthik Kambatla added a comment - Thanks for the quick turnaround on this, Robert. One nit-pick on the test: would the following be more concise? if (retryCount < 5) { retryCount++; if (isExpectingNMNotYetReadyException) { containerManager.setBlockNewContainerRequests( true ); } else { throw new java.net.ConnectException( "start container exception" ); } } else { containerManager.setBlockNewContainerRequests( false ); } return super .startContainers(requests);
          Hide
          hadoopqa Hadoop QA added a comment -



          +1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 17m 20s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 46s There were no new javac warning messages.
          +1 javadoc 9m 42s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 1m 30s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 2m 47s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 1m 58s Tests passed in hadoop-yarn-common.
          +1 yarn tests 6m 5s Tests passed in hadoop-yarn-server-nodemanager.
              49m 40s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12741131/YARN-3842.001.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 445b132
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8314/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8314/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8314/testReport/
          Java 1.7.0_55
          uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8314/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 20s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 46s There were no new javac warning messages. +1 javadoc 9m 42s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 30s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 2m 47s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 1m 58s Tests passed in hadoop-yarn-common. +1 yarn tests 6m 5s Tests passed in hadoop-yarn-server-nodemanager.     49m 40s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741131/YARN-3842.001.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 445b132 hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8314/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8314/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8314/testReport/ Java 1.7.0_55 uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8314/console This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          That makes sense. The patch is also a lot simpler; it just adds a retry policy for NMNotYetReadyException, and a test.

          Show
          rkanter Robert Kanter added a comment - That makes sense. The patch is also a lot simpler; it just adds a retry policy for NMNotYetReadyException , and a test.
          Hide
          kasha Karthik Kambatla added a comment -

          My bad. I misinterpreted Vinod's suggestion to catch only NMNotYetReadyException, don't ask me how. I somehow thought he was against retries.

          I fully agree with retrying on NMNotYetReadyException alone.

          Show
          kasha Karthik Kambatla added a comment - My bad. I misinterpreted Vinod's suggestion to catch only NMNotYetReadyException, don't ask me how. I somehow thought he was against retries. I fully agree with retrying on NMNotYetReadyException alone.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 pre-patch 16m 1s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 43s There were no new javac warning messages.
          +1 javadoc 9m 42s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 36s The applied patch generated 4 new checkstyle issues (total was 261, now 264).
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 34s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 7s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 mapreduce tests 9m 7s Tests passed in hadoop-mapreduce-client-app.
              46m 53s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12741082/MAPREDUCE-6409.002.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 445b132
          Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
          hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/testReport/
          Java 1.7.0_55
          uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 16m 1s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 43s There were no new javac warning messages. +1 javadoc 9m 42s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 36s The applied patch generated 4 new checkstyle issues (total was 261, now 264). +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 34s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 7s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 mapreduce tests 9m 7s Tests passed in hadoop-mapreduce-client-app.     46m 53s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741082/MAPREDUCE-6409.002.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 445b132 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/testReport/ Java 1.7.0_55 uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5826/console This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          I think it's a little harsh to treat NMNotYetReadyException as a failure to launch without any retries. We don't do this for connection refused or socket connection timeout, yet this is effectively an application-level connection refusal. I agree with what Vinod mentioned earlier – we should simply retry the exception. Can we have NMProxy setup the proxy to retry NMNotYetReadyException? In most cases the retries will eventually succeed before it times out, and that's preferable to throwing away the container and needing to allocate a new one.

          Show
          jlowe Jason Lowe added a comment - I think it's a little harsh to treat NMNotYetReadyException as a failure to launch without any retries. We don't do this for connection refused or socket connection timeout, yet this is effectively an application-level connection refusal. I agree with what Vinod mentioned earlier – we should simply retry the exception. Can we have NMProxy setup the proxy to retry NMNotYetReadyException? In most cases the retries will eventually succeed before it times out, and that's preferable to throwing away the container and needing to allocate a new one.
          Hide
          rkanter Robert Kanter added a comment -

          The new patch renames the enum to FAILED_BY_YARN. The checkstyle warnings are all for lines in the state machine transitions, which currently match the rest of the lines so I don't want to fix those.

          Show
          rkanter Robert Kanter added a comment - The new patch renames the enum to FAILED_BY_YARN . The checkstyle warnings are all for lines in the state machine transitions, which currently match the rest of the lines so I don't want to fix those.
          Hide
          kasha Karthik Kambatla added a comment -

          How about we remove NMNotYetReadyException in trunk and branch-2 to minimize any risk, and do MR-only changes for 2.7.1? Filed YARN-3839 to handle that.

          Robert Kanter - thanks for picking this up. The patch looks mostly good to me. One nit: what do you think of suffixing the event-type FAILED_BY_YARN instead of FAILED_DUE_TO_YARN?

          Also, what do you think of catching YarnException instead of NMNotYetReadyException? If there is an exception that shouldn't be caught, it shouldn't be a YarnException? We can fix this as part of YARN-3839. /cc Vinod Kumar Vavilapalli, Jason Lowe, Jian He

          Show
          kasha Karthik Kambatla added a comment - How about we remove NMNotYetReadyException in trunk and branch-2 to minimize any risk, and do MR-only changes for 2.7.1? Filed YARN-3839 to handle that. Robert Kanter - thanks for picking this up. The patch looks mostly good to me. One nit: what do you think of suffixing the event-type FAILED_BY_YARN instead of FAILED_DUE_TO_YARN ? Also, what do you think of catching YarnException instead of NMNotYetReadyException ? If there is an exception that shouldn't be caught, it shouldn't be a YarnException? We can fix this as part of YARN-3839 . /cc Vinod Kumar Vavilapalli , Jason Lowe , Jian He
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 pre-patch 15m 51s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 45s There were no new javac warning messages.
          +1 javadoc 9m 53s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 37s The applied patch generated 4 new checkstyle issues (total was 261, now 264).
          +1 whitespace 0m 1s The patch has no lines that end in whitespace.
          +1 install 1m 34s mvn install still works.
          +1 eclipse:eclipse 0m 35s The patch built with eclipse:eclipse.
          +1 findbugs 1m 7s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 mapreduce tests 9m 12s Tests passed in hadoop-mapreduce-client-app.
              47m 0s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12740777/MAPREDUCE-6409.001.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 20c03c9
          Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
          hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/testReport/
          Java 1.7.0_55
          uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 15m 51s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 45s There were no new javac warning messages. +1 javadoc 9m 53s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 37s The applied patch generated 4 new checkstyle issues (total was 261, now 264). +1 whitespace 0m 1s The patch has no lines that end in whitespace. +1 install 1m 34s mvn install still works. +1 eclipse:eclipse 0m 35s The patch built with eclipse:eclipse. +1 findbugs 1m 7s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 mapreduce tests 9m 12s Tests passed in hadoop-mapreduce-client-app.     47m 0s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12740777/MAPREDUCE-6409.001.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 20c03c9 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/testReport/ Java 1.7.0_55 uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5816/console This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          I moved this to MAPREDUCE because I'm doing the 3rd suggestion that Karthik mentioned where MR handles this type of failure differently, and doesn't count it against the retries.

          Show
          rkanter Robert Kanter added a comment - I moved this to MAPREDUCE because I'm doing the 3rd suggestion that Karthik mentioned where MR handles this type of failure differently, and doesn't count it against the retries.
          Hide
          rkanter Robert Kanter added a comment -

          Karthik Kambatla said I can take this over.

          Show
          rkanter Robert Kanter added a comment - Karthik Kambatla said I can take this over.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server.

          I didn't realize this, tx for pointing out.

          Had an offline discussion with Jian He and couldn't come up with a case where not blocking the calls will be a problem. In all the cases, whether the calls are blocked or not, eventually they will be rejected for invalid-token-error or container-given-by-old-RM. Even if the calls are not blocked, the same errors happen right-away.

          I am +1 now for not throwing this exception from the NM side. But given that it is part of the contract, I don't think we should remove the class in case.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server. I didn't realize this, tx for pointing out. Had an offline discussion with Jian He and couldn't come up with a case where not blocking the calls will be a problem. In all the cases, whether the calls are blocked or not, eventually they will be rejected for invalid-token-error or container-given-by-old-RM. Even if the calls are not blocked, the same errors happen right-away. I am +1 now for not throwing this exception from the NM side. But given that it is part of the contract, I don't think we should remove the class in case.
          Hide
          jianhe Jian He added a comment -

          this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration.

          For RM work-preserving restart, this is not a problem as the NM remain as-is.
          For NM restart with no recovery, all outstanding containers allocated on this node are anyways killed.
          For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server.

              if (delayedRpcServerStart) {
                waitForRecoveredContainers();
                server.start();
          

          Overall, I think it's fine to add a client retry fix in 2.7.1; But long term I'd like to re-visit this, may be I still miss something.

          Show
          jianhe Jian He added a comment - this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. For RM work-preserving restart, this is not a problem as the NM remain as-is. For NM restart with no recovery, all outstanding containers allocated on this node are anyways killed. For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server. if (delayedRpcServerStart) { waitForRecoveredContainers(); server.start(); Overall, I think it's fine to add a client retry fix in 2.7.1; But long term I'd like to re-visit this, may be I still miss something.
          Hide
          jlowe Jason Lowe added a comment -

          this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration.

          Yes, but that's a limitation in the RPC layer. If we could bind the server before we start it then we could know the port, register with the RM, then start the server. IMHO the RPC layer should support this, but I understand we'll have to work around the lack of that in the interim. I think we all can agree the retry exception is just a hack being used because we can't keep the client service from serving too soon.

          Show
          jlowe Jason Lowe added a comment - this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. Yes, but that's a limitation in the RPC layer. If we could bind the server before we start it then we could know the port, register with the RM, then start the server. IMHO the RPC layer should support this, but I understand we'll have to work around the lack of that in the interim. I think we all can agree the retry exception is just a hack being used because we can't keep the client service from serving too soon.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException?

          Karthik Kambatla, we could. Let's file a separate JIRA?

          we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice

          Jason Lowe, this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration.

          2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid.

          3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid.

          Jian He, these two errors will be much harder for apps to process and react to than the current named exception.

          Further, things like Auxiliary services are also not setup already by time the RPC server starts and depending on how the service order changes over time, users may get different types of errors. Overall, I am in favor of keeping the named exception with clients explicitly retrying.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException? Karthik Kambatla , we could. Let's file a separate JIRA? we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice Jason Lowe , this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. 2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid. 3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid. Jian He , these two errors will be much harder for apps to process and react to than the current named exception. Further, things like Auxiliary services are also not setup already by time the RPC server starts and depending on how the service order changes over time, users may get different types of errors. Overall, I am in favor of keeping the named exception with clients explicitly retrying.
          Hide
          jlowe Jason Lowe added a comment -

          I agree with Jian that we probably don't need the not ready exception. I was never a fan of it in the first place, as IMHO we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice. As Jian points out, I think the NMToken will cover the cases where someone is trying to launch something they shouldn't be launching, so I don't think we need to wait for the RM registration.

          Show
          jlowe Jason Lowe added a comment - I agree with Jian that we probably don't need the not ready exception. I was never a fan of it in the first place, as IMHO we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice. As Jian points out, I think the NMToken will cover the cases where someone is trying to launch something they shouldn't be launching, so I don't think we need to wait for the RM registration.
          Hide
          kasha Karthik Kambatla added a comment -

          We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException?

          On the client side (MR-AM in this case), we should probably consider any YarnException as a system error and count it against KILLED?

          Show
          kasha Karthik Kambatla added a comment - We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException? On the client side (MR-AM in this case), we should probably consider any YarnException as a system error and count it against KILLED?
          Hide
          jianhe Jian He added a comment -

          I'm actually thinking do we still need the NMNotYetReadyException.. the NMNotYetReadyException is currently thrown when NM starts the service but not yet register/re-register with RM. it may be ok to just launch the container.

          1. For work-preserving NM restart(scenario in this jira), I think it's ok to just launch the container instead of throwing exception.
          2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid.
          3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid.

          Show
          jianhe Jian He added a comment - I'm actually thinking do we still need the NMNotYetReadyException.. the NMNotYetReadyException is currently thrown when NM starts the service but not yet register/re-register with RM. it may be ok to just launch the container. 1. For work-preserving NM restart(scenario in this jira), I think it's ok to just launch the container instead of throwing exception. 2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid. 3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions?

          The client should already be unwrapping and throwing the right exception locally. The diagnostic message you posted also seems to be pointing the same..

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions? The client should already be unwrapping and throwing the right exception locally. The diagnostic message you posted also seems to be pointing the same..
          Hide
          kasha Karthik Kambatla added a comment -

          By the way, here is the stack trace:

          2015-06-16 17:31:36,663 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1434500031312_0008_m_000035_0: Container launch failed for container_e04_1434500031312_0008_01_000037 : org.apache.hadoop.yarn.exceptions.NMNotYetReadyException: Rejecting new containers as NodeManager has not yet connected with ResourceManager
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693)
          	at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
          	at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:415)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)
          
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
          	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
          	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
          	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
          	at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99)
          	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:606)
          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
          	at com.sun.proxy.$Proxy40.startContainers(Unknown Source)
          	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:151)
          	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.NMNotYetReadyException): Rejecting new containers as NodeManager has not yet connected with ResourceManager
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693)
          	at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
          	at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:415)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)
          
          	at org.apache.hadoop.ipc.Client.call(Client.java:1468)
          	at org.apache.hadoop.ipc.Client.call(Client.java:1399)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
          	at com.sun.proxy.$Proxy39.startContainers(Unknown Source)
          	at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
          
          Show
          kasha Karthik Kambatla added a comment - By the way, here is the stack trace: 2015-06-16 17:31:36,663 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1434500031312_0008_m_000035_0: Container launch failed for container_e04_1434500031312_0008_01_000037 : org.apache.hadoop.yarn.exceptions.NMNotYetReadyException: Rejecting new containers as NodeManager has not yet connected with ResourceManager at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy40.startContainers(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:151) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.NMNotYetReadyException): Rejecting new containers as NodeManager has not yet connected with ResourceManager at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy39.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
          Hide
          kasha Karthik Kambatla added a comment -

          This wasn't as big an issue without work-preserving RM restart, as the AM itself would be restarted and the window of opportunity for it to try launching containers was fairly small.

          the right solution is for clients to retry NMNotYetReadyException

          I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions?

          Show
          kasha Karthik Kambatla added a comment - This wasn't as big an issue without work-preserving RM restart, as the AM itself would be restarted and the window of opportunity for it to try launching containers was fairly small. the right solution is for clients to retry NMNotYetReadyException I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          This is a long standing issue - we added the exception in YARN-562.

          I think that instead of blanket retries (solution #1) above, the right solution is for clients to retry NMNotYetReadyException. We can do that in NMClient library for java clients? /cc Jian He

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - This is a long standing issue - we added the exception in YARN-562 . I think that instead of blanket retries (solution #1) above, the right solution is for clients to retry NMNotYetReadyException. We can do that in NMClient library for java clients? /cc Jian He
          Hide
          kasha Karthik Kambatla added a comment -

          The issue is with counting container-launch-failures against the 4 task failures. We could potentially go about this in different ways:

          1. Support retries when launching containers. Start/stop containers are @AtMostOnce operations. This works okay for NM restart cases. When an NM goes down, this will lead to the job waiting longer before trying another node.
          2. On failure to launch container, return an error code that explicitly annotates it as a system error and not a user error. The AMs could choose to not count system errors against number of task attempt failures.
          3. Without any changes in Yarn, MR should identify exceptions on startContainers() different from failures captured in StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and IOException will not be counted against the number of allowed failures.

          Option 2 seems like a cleaner approach to me.

          Show
          kasha Karthik Kambatla added a comment - The issue is with counting container-launch-failures against the 4 task failures. We could potentially go about this in different ways: Support retries when launching containers. Start/stop containers are @AtMostOnce operations. This works okay for NM restart cases. When an NM goes down, this will lead to the job waiting longer before trying another node. On failure to launch container, return an error code that explicitly annotates it as a system error and not a user error. The AMs could choose to not count system errors against number of task attempt failures. Without any changes in Yarn, MR should identify exceptions on startContainers() different from failures captured in StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and IOException will not be counted against the number of allowed failures. Option 2 seems like a cleaner approach to me.
          Hide
          kasha Karthik Kambatla added a comment -

          We ran into this in our rolling upgrade tests.

          Show
          kasha Karthik Kambatla added a comment - We ran into this in our rolling upgrade tests.

            People

            • Assignee:
              rkanter Robert Kanter
              Reporter:
              kasha Karthik Kambatla
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development