Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12317

Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart.

      The application will fail with error. The attempts are not retried.

      Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154
      
      1. YARN-4096.001.patch
        3 kB
        Anubhav Dhoot
      2. YARN-4046.002.patch
        3 kB
        Anubhav Dhoot
      3. YARN-4046.002.patch
        3 kB
        Anubhav Dhoot

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2239 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2239/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2239 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2239/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #290 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/290/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #290 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/290/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          Hide
          hudson Hudson added a comment -

          ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #282 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/282/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #282 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/282/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          ABORTED: Integrated in Hadoop-Hdfs-trunk #2220 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2220/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          Show
          hudson Hudson added a comment - ABORTED: Integrated in Hadoop-Hdfs-trunk #2220 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2220/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #1023 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1023/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1023 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1023/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #293 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/293/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #293 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/293/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          chackra Chackaravarthy added a comment -

          yes, this addresses corner cases as well. Thanks for the patch Anubhav Dhoot

          Show
          chackra Chackaravarthy added a comment - yes, this addresses corner cases as well. Thanks for the patch Anubhav Dhoot
          Hide
          adhoot Anubhav Dhoot added a comment -

          Thanks Robert for review and commit!

          Show
          adhoot Anubhav Dhoot added a comment - Thanks Robert for review and commit!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8325 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8325/)
          HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8325 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8325/ ) HADOOP-12317 . Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
          Hide
          rkanter Robert Kanter added a comment -

          Thanks for the fix Anubhav. Committed to trunk and branch-2!

          Show
          rkanter Robert Kanter added a comment - Thanks for the fix Anubhav. Committed to trunk and branch-2!
          Hide
          stevel@apache.org Steve Loughran added a comment -

          This patch seems to address all the corner cases of the problem —and justify the workings, so I'm +1 for it

          Show
          stevel@apache.org Steve Loughran added a comment - This patch seems to address all the corner cases of the problem —and justify the workings, so I'm +1 for it
          Hide
          rkanter Robert Kanter added a comment -

          +1 LGTM. I think this better fixes the problem than HADOOP-11989 does; it also has a unit test. Thoughts Steve Loughran?

          Show
          rkanter Robert Kanter added a comment - +1 LGTM. I think this better fixes the problem than HADOOP-11989 does; it also has a unit test. Thoughts Steve Loughran ?
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 32s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 36s There were no new javac warning messages.
          +1 javadoc 9m 36s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 3s The applied patch generated 2 new checkstyle issues (total was 97, now 98).
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 22s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 51s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 common tests 22m 21s Tests failed in hadoop-common.
              61m 20s  



          Reason Tests
          Failed unit tests hadoop.ha.TestZKFailoverController
            hadoop.net.TestNetUtils



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12750001/YARN-4046.002.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / b73181f
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/diffcheckstylehadoop-common.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/whitespace.txt
          hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/testrun_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 32s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 36s There were no new javac warning messages. +1 javadoc 9m 36s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 3s The applied patch generated 2 new checkstyle issues (total was 97, now 98). -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 22s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 51s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 common tests 22m 21s Tests failed in hadoop-common.     61m 20s   Reason Tests Failed unit tests hadoop.ha.TestZKFailoverController   hadoop.net.TestNetUtils Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12750001/YARN-4046.002.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / b73181f checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/diffcheckstylehadoop-common.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/whitespace.txt hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/artifact/patchprocess/testrun_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/7463/console This message was automatically generated.
          Hide
          adhoot Anubhav Dhoot added a comment -

          Thanks Rohith Sharma K S for pointing it out. I have added a comment there why that fix is not working in all cases.

          Show
          adhoot Anubhav Dhoot added a comment - Thanks Rohith Sharma K S for pointing it out. I have added a comment there why that fix is not working in all cases.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          It looks it is a duplicate of YARN-3561. For fixing this issue, there is issue exist in hadoop-common i.e HADOOP-11989.
          Anubhav Dhoot would you have look at HADOOP-11989?

          Show
          rohithsharma Rohith Sharma K S added a comment - It looks it is a duplicate of YARN-3561 . For fixing this issue, there is issue exist in hadoop-common i.e HADOOP-11989 . Anubhav Dhoot would you have look at HADOOP-11989 ?
          Hide
          adhoot Anubhav Dhoot added a comment -

          Fixed whitespace

          Show
          adhoot Anubhav Dhoot added a comment - Fixed whitespace
          Hide
          adhoot Anubhav Dhoot added a comment -

          Fixed a new checkstyle that was added, the other two are preexisting and should not be fixed.

          Show
          adhoot Anubhav Dhoot added a comment - Fixed a new checkstyle that was added, the other two are preexisting and should not be fixed.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 17m 7s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 59s There were no new javac warning messages.
          +1 javadoc 9m 52s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 9s The applied patch generated 3 new checkstyle issues (total was 97, now 99).
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 21s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 1m 57s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 common tests 22m 42s Tests failed in hadoop-common.
              63m 5s  



          Reason Tests
          Failed unit tests hadoop.ha.TestZKFailoverController
            hadoop.net.TestNetUtils



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12749949/YARN-4096.001.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 7c796fd
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/diffcheckstylehadoop-common.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/whitespace.txt
          hadoop-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/testrun_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8825/testReport/
          Java 1.7.0_55
          uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8825/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 7s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 59s There were no new javac warning messages. +1 javadoc 9m 52s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 9s The applied patch generated 3 new checkstyle issues (total was 97, now 99). -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 21s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 57s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 common tests 22m 42s Tests failed in hadoop-common.     63m 5s   Reason Tests Failed unit tests hadoop.ha.TestZKFailoverController   hadoop.net.TestNetUtils Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12749949/YARN-4096.001.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 7c796fd checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/diffcheckstylehadoop-common.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/whitespace.txt hadoop-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/testrun_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8825/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8825/console This message was automatically generated.
          Hide
          adhoot Anubhav Dhoot added a comment -

          Chris Nauroth appreciate your review

          Show
          adhoot Anubhav Dhoot added a comment - Chris Nauroth appreciate your review
          Hide
          adhoot Anubhav Dhoot added a comment -

          Attaching patch that prefixes "--" when using negative pid for kill

          Show
          adhoot Anubhav Dhoot added a comment - Attaching patch that prefixes "--" when using negative pid for kill
          Hide
          adhoot Anubhav Dhoot added a comment -

          As per GNU linux documentation "-" may not be needed, but looks like all distros (Debian) do not support not having "-".

           If a negative pid argument is desired as the first one, it should be preceded by --. However, as a common extension to POSIX, -- is not required with ‘kill -signal -pid’. 

          So a fix is to prefix "--" always to match the recommendation.

          Show
          adhoot Anubhav Dhoot added a comment - As per GNU linux documentation "- " may not be needed, but looks like all distros (Debian) do not support not having " -". If a negative pid argument is desired as the first one, it should be preceded by --. However, as a common extension to POSIX, -- is not required with ‘kill -signal -pid’. So a fix is to prefix "--" always to match the recommendation.
          Hide
          adhoot Anubhav Dhoot added a comment -

          The error in NodeManager shows

          2015-08-10 15:14:05,567 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e45_1439244348718_0001_01_000001
          java.io.IOException: Timeout while waiting for exit code from container_e45_1439244348718_0001_01_000001
          	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:745)
          

          Looking under the debugger the actual shell command to check if container is alive fails because the kill command syntax "kill -0 -20773" fails.

          his = {org.apache.hadoop.util.Shell$ShellCommandExecutor@6740} "kill -0 -20773 "
          builder = {java.lang.ProcessBuilder@6789} 
           command = {java.util.ArrayList@6813}  size = 3
           directory = null
           environment = null
           redirectErrorStream = false
           redirects = null
          timeOutTimer = null
          timeoutTimerTask = null
          errReader = {java.io.BufferedReader@6830} 
          inReader = {java.io.BufferedReader@6833} 
          errMsg = {java.lang.StringBuffer@6836} "kill: invalid option -- '2'\n\nUsage:\n kill [options] <pid> [...]\n\nOptions:\n <pid> [...]            send signal to every <pid> listed\n -<signal>, -s, --signal <signal>\n                        specify the <signal> to be sent\n -l, --list=[<signal>]  list all signal names, or convert one to a name\n -L, --table            list all signal names in a nice table\n\n -h, --help     display this help and exit\n -V, --version  output version information and exit\n\nFor more details see kill(1).\n"
          errThread = {org.apache.hadoop.util.Shell$1@6839} "Thread[Thread-102,5,]"
          line = null
          exitCode = 1
          completed = {java.util.concurrent.atomic.AtomicBoolean@6806} "true"
          

          This causes DefaultContainerExecutor#containerIsAlive to catch ExitCodeException thrown by ShellCommandExecutor.execute making it assume the container is lost.

          Show
          adhoot Anubhav Dhoot added a comment - The error in NodeManager shows 2015-08-10 15:14:05,567 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e45_1439244348718_0001_01_000001 java.io.IOException: Timeout while waiting for exit code from container_e45_1439244348718_0001_01_000001 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Looking under the debugger the actual shell command to check if container is alive fails because the kill command syntax "kill -0 -20773" fails. his = {org.apache.hadoop.util.Shell$ShellCommandExecutor@6740} "kill -0 -20773 " builder = {java.lang.ProcessBuilder@6789} command = {java.util.ArrayList@6813} size = 3 directory = null environment = null redirectErrorStream = false redirects = null timeOutTimer = null timeoutTimerTask = null errReader = {java.io.BufferedReader@6830} inReader = {java.io.BufferedReader@6833} errMsg = {java.lang.StringBuffer@6836} "kill: invalid option -- '2'\n\nUsage:\n kill [options] <pid> [...]\n\nOptions:\n <pid> [...] send signal to every <pid> listed\n -<signal>, -s, --signal <signal>\n specify the <signal> to be sent\n -l, --list=[<signal>] list all signal names, or convert one to a name\n -L, --table list all signal names in a nice table\n\n -h, --help display this help and exit\n -V, --version output version information and exit\n\nFor more details see kill(1).\n" errThread = {org.apache.hadoop.util.Shell$1@6839} "Thread[Thread-102,5,]" line = null exitCode = 1 completed = {java.util.concurrent.atomic.AtomicBoolean@6806} "true" This causes DefaultContainerExecutor#containerIsAlive to catch ExitCodeException thrown by ShellCommandExecutor.execute making it assume the container is lost.

            People

            • Assignee:
              adhoot Anubhav Dhoot
              Reporter:
              adhoot Anubhav Dhoot
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development