Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5103

With NM recovery enabled, restarting NM multiple times results in AM restart

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: yarn
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      AM is restarted when NM is restarted multiple times even though NM recovery is enabled.

      NM log on which AM attempt 1 was running
       ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e12_1463043063682_0002_01_000001
      java.io.IOException: java.lang.InterruptedException
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:579)
      	at org.apache.hadoop.util.Shell.run(Shell.java:487)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:478)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerProcessAlive(LinuxContainerExecutor.java:542)
      	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:185)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:445)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      1. YARN-5103.patch
        2 kB
        Junping Du
      2. YARN-5103-demo.patch
        2 kB
        Junping Du
      3. YARN-5103-v2.patch
        2 kB
        Junping Du

        Issue Links

          Activity

          Hide
          djp Junping Du added a comment -

          This is due to we catch InterruptedException and throw an IOException instead in Shell.runCommand()

            /** Run a command */
            private void runCommand() throws IOException { 
          ...
              } catch (InterruptedException ie) {
                throw new IOException(ie.toString());
              } 
          ...
          

          In RecoveredContainerLaunch, we should check IOException also.

          Show
          djp Junping Du added a comment - This is due to we catch InterruptedException and throw an IOException instead in Shell.runCommand() /** Run a command */ private void runCommand() throws IOException { ... } catch (InterruptedException ie) { throw new IOException(ie.toString()); } ... In RecoveredContainerLaunch, we should check IOException also.
          Hide
          djp Junping Du added a comment -

          Put a quick patch first. Haven't put up any unit test. Will add it in next patch.

          Show
          djp Junping Du added a comment - Put a quick patch first. Haven't put up any unit test. Will add it in next patch.
          Hide
          djp Junping Du added a comment -

          It sounds a bit difficult to add unit test to cover case here - there are many objects need to mock and RecoveredContainerLaunch's internal logic need to check pid path which is not easily to mock (or we can change the logic there, but make code looks very tricky).
          I update the patch a bit given interrupted exception get wrapped up as InterruptedIOException in HADOOP-12074.
          Jason Lowe, would you help to review it? Thanks!

          Show
          djp Junping Du added a comment - It sounds a bit difficult to add unit test to cover case here - there are many objects need to mock and RecoveredContainerLaunch's internal logic need to check pid path which is not easily to mock (or we can change the logic there, but make code looks very tricky). I update the patch a bit given interrupted exception get wrapped up as InterruptedIOException in HADOOP-12074 . Jason Lowe , would you help to review it? Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 5m 59s trunk passed
          +1 compile 0m 23s trunk passed
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 25s trunk passed
          +1 mvneclipse 0m 11s trunk passed
          +1 findbugs 0m 37s trunk passed
          +1 javadoc 0m 18s trunk passed
          +1 mvninstall 0m 20s the patch passed
          +1 compile 0m 21s the patch passed
          +1 javac 0m 21s the patch passed
          +1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 0 new + 5 unchanged - 1 fixed = 5 total (was 6)
          +1 mvnsite 0m 22s the patch passed
          +1 mvneclipse 0m 8s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 43s the patch passed
          +1 javadoc 0m 15s the patch passed
          +1 unit 11m 19s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 14s Patch does not generate ASF License warnings.
          23m 0s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805233/YARN-5103.patch
          JIRA Issue YARN-5103
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux c6172e5809a3 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 757050f
          Default Java 1.8.0_91
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11588/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/11588/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 5m 59s trunk passed +1 compile 0m 23s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 25s trunk passed +1 mvneclipse 0m 11s trunk passed +1 findbugs 0m 37s trunk passed +1 javadoc 0m 18s trunk passed +1 mvninstall 0m 20s the patch passed +1 compile 0m 21s the patch passed +1 javac 0m 21s the patch passed +1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 0 new + 5 unchanged - 1 fixed = 5 total (was 6) +1 mvnsite 0m 22s the patch passed +1 mvneclipse 0m 8s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 43s the patch passed +1 javadoc 0m 15s the patch passed +1 unit 11m 19s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 14s Patch does not generate ASF License warnings. 23m 0s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805233/YARN-5103.patch JIRA Issue YARN-5103 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c6172e5809a3 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 757050f Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11588/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/11588/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch! I'm OK skipping the unit test for this case.

          Rather than catching IOException and explicitly checking the instance we should let the normal catch processing do it for us, e.g.:

              } catch (InterruptedException | InterruptedIOException e) {
                 LOG.warn("Interrupted while waiting for exit code from " + containerId);
                 notInterrupted = false;
              } catch (IOException e) {
                 LOG.error("Unable to recover container " + containerIdStr, e);
              }
          

          I noticed this is targeted to 2.9, but I would think this should go into at least 2.8 as well?

          Show
          jlowe Jason Lowe added a comment - Thanks for the patch! I'm OK skipping the unit test for this case. Rather than catching IOException and explicitly checking the instance we should let the normal catch processing do it for us, e.g.: } catch (InterruptedException | InterruptedIOException e) { LOG.warn( "Interrupted while waiting for exit code from " + containerId); notInterrupted = false ; } catch (IOException e) { LOG.error( "Unable to recover container " + containerIdStr, e); } I noticed this is targeted to 2.9, but I would think this should go into at least 2.8 as well?
          Hide
          djp Junping Du added a comment -

          Thanks Jason Lowe for review and comments! v2 patch incorporate your comments above.
          About target branch, I agree it should be better to commit to branch-2.8 as well - just update it.

          Show
          djp Junping Du added a comment - Thanks Jason Lowe for review and comments! v2 patch incorporate your comments above. About target branch, I agree it should be better to commit to branch-2.8 as well - just update it.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 11m 40s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 5m 55s trunk passed
          +1 compile 0m 23s trunk passed
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 25s trunk passed
          +1 mvneclipse 0m 11s trunk passed
          +1 findbugs 0m 39s trunk passed
          +1 javadoc 0m 17s trunk passed
          +1 mvninstall 0m 20s the patch passed
          +1 compile 0m 21s the patch passed
          +1 javac 0m 21s the patch passed
          -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 6 unchanged - 1 fixed = 7 total (was 7)
          +1 mvnsite 0m 22s the patch passed
          +1 mvneclipse 0m 9s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 42s the patch passed
          +1 javadoc 0m 15s the patch passed
          +1 unit 11m 11s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 14s Patch does not generate ASF License warnings.
          34m 7s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:2c91fd8
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805557/YARN-5103-v2.patch
          JIRA Issue YARN-5103
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 7fe8cc378812 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6161d9b
          Default Java 1.8.0_91
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/11626/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11626/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/11626/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 11m 40s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 5m 55s trunk passed +1 compile 0m 23s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 25s trunk passed +1 mvneclipse 0m 11s trunk passed +1 findbugs 0m 39s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 20s the patch passed +1 compile 0m 21s the patch passed +1 javac 0m 21s the patch passed -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 6 unchanged - 1 fixed = 7 total (was 7) +1 mvnsite 0m 22s the patch passed +1 mvneclipse 0m 9s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 42s the patch passed +1 javadoc 0m 15s the patch passed +1 unit 11m 11s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 14s Patch does not generate ASF License warnings. 34m 7s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805557/YARN-5103-v2.patch JIRA Issue YARN-5103 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7fe8cc378812 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6161d9b Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/11626/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11626/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/11626/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1 latest patch lgtm. I'll fix the checkstyle indentation nit as part of the commit.

          Show
          jlowe Jason Lowe added a comment - +1 latest patch lgtm. I'll fix the checkstyle indentation nit as part of the commit.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Junping! I committed this to trunk, branch-2, and branch-2.8.

          Show
          jlowe Jason Lowe added a comment - Thanks, Junping! I committed this to trunk, branch-2, and branch-2.8.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #9841 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9841/)
          YARN-5103. With NM recovery enabled, restarting NM multiple times (jlowe: rev d1df0266cf4e9ff0ec70813c156556ca4e74f791)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9841 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9841/ ) YARN-5103 . With NM recovery enabled, restarting NM multiple times (jlowe: rev d1df0266cf4e9ff0ec70813c156556ca4e74f791) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
          Hide
          djp Junping Du added a comment -

          Thanks Jason Lowe for review and commit!

          Show
          djp Junping Du added a comment - Thanks Jason Lowe for review and commit!

            People

            • Assignee:
              djp Junping Du
              Reporter:
              ssathish@hortonworks.com Sumana Sathish
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development