Hadoop YARN
  1. Hadoop YARN
  2. YARN-72

NM should handle cleaning up containers when it shuts down

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.3-alpha, 0.23.6
    • Component/s: nodemanager
    • Labels:
      None

      Description

      Ideally, the NM should wait for a limited amount of time when it gets a shutdown signal for existing containers to complete and kill the containers ( if we pick an aggressive approach ) after this time interval.

      For NMs which come up after an unclean shutdown, the NM should look through its directories for existing container.pids and try and kill an existing containers matching the pids found.

      1. YARN-72-2.patch
        24 kB
        Sandy Ryza
      2. YARN-72-2.patch
        24 kB
        Sandy Ryza
      3. YARN-72-2.patch
        24 kB
        Sandy Ryza
      4. YARN-72-1.patch
        23 kB
        Sandy Ryza
      5. YARN-72.patch
        8 kB
        Sandy Ryza

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #585 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/585/)
          YARN-72. Forgot to add 2 files to branch-0.23 (Revision 1469056)

          Result = UNSTABLE
          tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469056
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java
          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #585 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/585/ ) YARN-72 . Forgot to add 2 files to branch-0.23 (Revision 1469056) Result = UNSTABLE tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469056 Files : /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1246 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1246/)
          YARN-72. NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484)

          Result = FAILURE
          tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1246 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1246/ ) YARN-72 . NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484) Result = FAILURE tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #455 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/455/)
          YARN-72. NM should handle cleaning up containers when it shuts down. (Sandy Ryza via tomwhite) (Revision 1416562)

          Result = SUCCESS
          tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416562
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java
          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #455 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/455/ ) YARN-72 . NM should handle cleaning up containers when it shuts down. (Sandy Ryza via tomwhite) (Revision 1416562) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416562 Files : /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #56 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/56/)
          YARN-72. NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484)

          Result = SUCCESS
          tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #56 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/56/ ) YARN-72 . NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Hide
          Thomas Graves added a comment -

          I pulled this into branch-0.23 too.

          Show
          Thomas Graves added a comment - I pulled this into branch-0.23 too.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1276 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1276/)
          YARN-72. NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484)

          Result = SUCCESS
          tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1276 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1276/ ) YARN-72 . NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3084 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3084/)
          YARN-72. NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484)

          Result = SUCCESS
          tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3084 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3084/ ) YARN-72 . NM should handle cleaning up containers when it shuts down. Contributed by Sandy Ryza. (Revision 1416484) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1416484 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/CMgrCompletedContainersEvent.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockNodeStatusUpdater.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
          Hide
          Tom White added a comment -

          I just committed this. Thanks, Sandy!

          Show
          Tom White added a comment - I just committed this. Thanks, Sandy!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12555583/YARN-72-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/204//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/204//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555583/YARN-72-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/204//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/204//console This message is automatically generated.
          Hide
          Sandy Ryza added a comment -

          Added that in.

          Show
          Sandy Ryza added a comment - Added that in.
          Hide
          Thomas Graves added a comment -

          minor suggestion (by no means blocking since I'm so late on this) it might be useful to print the containers that didn't finish in cleanupContainers:

          + // All containers killed
          + if (containers.isEmpty())

          { + LOG.info("All containers in DONE state"); + }
          Show
          Thomas Graves added a comment - minor suggestion (by no means blocking since I'm so late on this) it might be useful to print the containers that didn't finish in cleanupContainers: + // All containers killed + if (containers.isEmpty()) { + LOG.info("All containers in DONE state"); + }
          Hide
          Tom White added a comment -

          Updated summary to reflect current scope.

          Show
          Tom White added a comment - Updated summary to reflect current scope.
          Hide
          Tom White added a comment -

          +1

          Show
          Tom White added a comment - +1
          Hide
          Bikas Saha added a comment -

          If there is no other master value for these values then never mind

          Show
          Bikas Saha added a comment - If there is no other master value for these values then never mind
          Hide
          Sandy Ryza added a comment -

          Thanks for looking it over Bikas. Currently the other place these conf values are read is on instantiation of ContainerLaunch, one of which is created for each container. Is there somewhere else that it would make sense to store them that should be more authoritative than the Configuration object?

          Show
          Sandy Ryza added a comment - Thanks for looking it over Bikas. Currently the other place these conf values are read is on instantiation of ContainerLaunch, one of which is created for each container. Is there somewhere else that it would make sense to store them that should be more authoritative than the Configuration object?
          Hide
          Bikas Saha added a comment -

          Looks good. Minor nit

          If these conf values have already been read to actual member values then we might want to use them instead of reading the conf directly. This way we can account for any slop that those values may have added of their own.

          +    waitForContainersOnShutdownMillis =
          +        conf.getLong(YarnConfiguration.NM_SLEEP_DELAY_BEFORE_SIGKILL_MS,
          +            YarnConfiguration.DEFAULT_NM_SLEEP_DELAY_BEFORE_SIGKILL_MS) + 
          +        conf.getLong(YarnConfiguration.NM_PROCESS_KILL_WAIT_MS,
          +            YarnConfiguration.DEFAULT_NM_PROCESS_KILL_WAIT_MS) +
          +        SHUTDOWN_CLEANUP_SLOP_MS;
          
          Show
          Bikas Saha added a comment - Looks good. Minor nit If these conf values have already been read to actual member values then we might want to use them instead of reading the conf directly. This way we can account for any slop that those values may have added of their own. + waitForContainersOnShutdownMillis = + conf.getLong(YarnConfiguration.NM_SLEEP_DELAY_BEFORE_SIGKILL_MS, + YarnConfiguration.DEFAULT_NM_SLEEP_DELAY_BEFORE_SIGKILL_MS) + + conf.getLong(YarnConfiguration.NM_PROCESS_KILL_WAIT_MS, + YarnConfiguration.DEFAULT_NM_PROCESS_KILL_WAIT_MS) + + SHUTDOWN_CLEANUP_SLOP_MS;
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12555274/YARN-72-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/192//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/192//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555274/YARN-72-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/192//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/192//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12555273/YARN-72-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/191//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/191//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555273/YARN-72-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/191//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/191//console This message is automatically generated.
          Hide
          Sandy Ryza added a comment -

          Latest pass makes slop value a constant, changes ms to millis, and has test failing without the change.

          Show
          Sandy Ryza added a comment - Latest pass makes slop value a constant, changes ms to millis, and has test failing without the change.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12554934/YARN-72-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/186//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/186//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554934/YARN-72-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/186//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/186//console This message is automatically generated.
          Hide
          Tom White added a comment -

          I don't think it needs to be configurable, since it's a best effort cleanup anyway. What you have seems reasonable to me, although you might want to make the 1000 a constant so it's clear what it is for (slop value).

          Does the test fail without the change?

          Nit: I would change waitForContainersOnShutdownMs to waitForContainersOnShutdownMillis.

          Show
          Tom White added a comment - I don't think it needs to be configurable, since it's a best effort cleanup anyway. What you have seems reasonable to me, although you might want to make the 1000 a constant so it's clear what it is for (slop value). Does the test fail without the change? Nit: I would change waitForContainersOnShutdownMs to waitForContainersOnShutdownMillis.
          Hide
          Sandy Ryza added a comment -

          Newest patch contains a test and timeout. The timeout is yarn.nodemanager.sleep-delay-before-sigkill.ms + yarn.nodemanager.process-kill-wait.ms + 1000. Should I make this configurable?

          Show
          Sandy Ryza added a comment - Newest patch contains a test and timeout. The timeout is yarn.nodemanager.sleep-delay-before-sigkill.ms + yarn.nodemanager.process-kill-wait.ms + 1000. Should I make this configurable?
          Hide
          Tom White added a comment -

          Sandy, this looks like a good start, hooking in the code for container cleanup. I would focus on the part to cleanup on shutdown in this patch, and tackle cleanup on startup in YARN-73.

          As Bikas mentioned there needs to be a timeout on waiting for the containers to shutdown. The shutdown process waits for up to yarn.nodemanager.process-kill-wait.ms for the PID to appear, then yarn.nodemanager.sleep-delay-before-sigkill.ms before sending a SIGKILL signal (after a SIGTERM) if the process hasn't died - see ContainerLaunch#cleanupContainer. Waiting for a little longer than the sum of these durations would be sufficient.

          Regarding testing, you could have a test like the one in TestContainerLaunch#testDelayedKill to test that containers are correctly cleaned up after stopping a NM.

          Show
          Tom White added a comment - Sandy, this looks like a good start, hooking in the code for container cleanup. I would focus on the part to cleanup on shutdown in this patch, and tackle cleanup on startup in YARN-73 . As Bikas mentioned there needs to be a timeout on waiting for the containers to shutdown. The shutdown process waits for up to yarn.nodemanager.process-kill-wait.ms for the PID to appear, then yarn.nodemanager.sleep-delay-before-sigkill.ms before sending a SIGKILL signal (after a SIGTERM) if the process hasn't died - see ContainerLaunch#cleanupContainer. Waiting for a little longer than the sum of these durations would be sufficient. Regarding testing, you could have a test like the one in TestContainerLaunch#testDelayedKill to test that containers are correctly cleaned up after stopping a NM.
          Hide
          Sandy Ryza added a comment -

          Yeah, wanted to get feedback before a refresh, but I'll put in the timeout fix and tests if others think this is a good approach?

          I had thought that cgroups relied on unix process groups, but searching around now, I couldn't find a connection, so it seems like interfering with YARN-3 wouldn't actually be a problem. The other comments still apply to unix process groups.

          Show
          Sandy Ryza added a comment - Yeah, wanted to get feedback before a refresh, but I'll put in the timeout fix and tests if others think this is a good approach? I had thought that cgroups relied on unix process groups, but searching around now, I couldn't find a connection, so it seems like interfering with YARN-3 wouldn't actually be a problem. The other comments still apply to unix process groups.
          Hide
          Bikas Saha added a comment -

          Will the timeout fix be made in a patch refresh?

          I did not mean cgroups used in YARN-3 but the setsid unix process groups. I have asked a question on YARN-3 on how the changes in that might affect the semantics of the operation in this jira and YARN-73. What do you think? Does anything need to be done specially for cgroups?

          Show
          Bikas Saha added a comment - Will the timeout fix be made in a patch refresh? I did not mean cgroups used in YARN-3 but the setsid unix process groups. I have asked a question on YARN-3 on how the changes in that might affect the semantics of the operation in this jira and YARN-73 . What do you think? Does anything need to be done specially for cgroups?
          Hide
          Sandy Ryza added a comment -

          You're right, we should have some sort of timeout, and just move on and exit after that.

          My worries about the process group approach would be:

          • A process to be killed via process group is sent a SIGHUP signal, which it can choose to catch and ignore. The current NodeManager mechanism that my patch makes use of ultimately sends a SIGKILL, which cannot be ignored.
          • Processes are allowed to change their own process group.
          • The proposed solution to YARN-3 also relies on a possibly conflicting use process groups (I believe a single one for each container?).
          • From cursory Googling, there doesn't seem to be any nice way in Java to deal with them.

          That said, I'd also defer to someone with a better understanding of NM.

          Show
          Sandy Ryza added a comment - You're right, we should have some sort of timeout, and just move on and exit after that. My worries about the process group approach would be: A process to be killed via process group is sent a SIGHUP signal, which it can choose to catch and ignore. The current NodeManager mechanism that my patch makes use of ultimately sends a SIGKILL, which cannot be ignored. Processes are allowed to change their own process group. The proposed solution to YARN-3 also relies on a possibly conflicting use process groups (I believe a single one for each container?). From cursory Googling, there doesn't seem to be any nice way in Java to deal with them. That said, I'd also defer to someone with a better understanding of NM.
          Hide
          Bikas Saha added a comment -

          Do we intend to stall shutdown indefinitely for this activity?

          +    while (!containers.isEmpty()) {
          +      try {
          +        Thread.sleep(1000);
          +      } catch (InterruptedException ex) {
          +        LOG.warn("Interrupted while sleeping on container kill", ex);
          +      }
          +    }
          

          This patch deserves some good tests to verify the new functionality.

          Overall the approach seems reasonable but I will defer to someone with a better understanding of NM.

          I was wondering if the NM could make itself part of a process group (like setsid) such that everything it spawns is also part of that process group. And the process group could be configured to terminate if the NM root process (NM) dies. Then the OS will take care of cleaning up the orphan processes. This might solve YARN-72 and YARN-73. Is something like this possible?

          Show
          Bikas Saha added a comment - Do we intend to stall shutdown indefinitely for this activity? + while (!containers.isEmpty()) { + try { + Thread .sleep(1000); + } catch (InterruptedException ex) { + LOG.warn( "Interrupted while sleeping on container kill" , ex); + } + } This patch deserves some good tests to verify the new functionality. Overall the approach seems reasonable but I will defer to someone with a better understanding of NM. I was wondering if the NM could make itself part of a process group (like setsid) such that everything it spawns is also part of that process group. And the process group could be configured to terminate if the NM root process (NM) dies. Then the OS will take care of cleaning up the orphan processes. This might solve YARN-72 and YARN-73 . Is something like this possible?
          Hide
          Sandy Ryza added a comment -

          My patch only handles killing containers on shutdown, not previous ones when starting back up.

          Show
          Sandy Ryza added a comment - My patch only handles killing containers on shutdown, not previous ones when starting back up.
          Hide
          Sandy Ryza added a comment -

          Code already exists to kill existing containers when the resource manager requests it. Three events are dispatched to make this happen: a COMPLETED_CONTAINERS event is handled by the ContainerManager, which dispatches a KILL_CONTAINER event for each container to be killed, which the ContainerImpls handle by dispatching CLEANUP_CONTAINER events, which are finally handled by the ContainersLauncher, which tries to kills the containers.

          Does it make more sense to use this chain events or to try to call the kill code directly? For the former, the issue would be how do we know when the cleanup has been completed? It looks like ContainerImpls have their state changed when their containers are killed, so the shutdown code could monitor them until they all reach the correct state, but a fair bit of plumbing would be required for the shutdown code to be able to get to them. For the latter, similar plumbing would be required for the shutdown code to reach the ContainerImpls, and the other issue would be circumventing the event system, which might have consequences that I'm not able to foresee?

          This is my first foray into nodemanager code, so maybe someone who understands it better can provide some perspective?

          Show
          Sandy Ryza added a comment - Code already exists to kill existing containers when the resource manager requests it. Three events are dispatched to make this happen: a COMPLETED_CONTAINERS event is handled by the ContainerManager, which dispatches a KILL_CONTAINER event for each container to be killed, which the ContainerImpls handle by dispatching CLEANUP_CONTAINER events, which are finally handled by the ContainersLauncher, which tries to kills the containers. Does it make more sense to use this chain events or to try to call the kill code directly? For the former, the issue would be how do we know when the cleanup has been completed? It looks like ContainerImpls have their state changed when their containers are killed, so the shutdown code could monitor them until they all reach the correct state, but a fair bit of plumbing would be required for the shutdown code to be able to get to them. For the latter, similar plumbing would be required for the shutdown code to reach the ContainerImpls, and the other issue would be circumventing the event system, which might have consequences that I'm not able to foresee? This is my first foray into nodemanager code, so maybe someone who understands it better can provide some perspective?
          Hide
          Bikas Saha added a comment -

          Pids might get reused. so we need to guard against that.
          Is there any generic Linux OS mechanism that would make the running containers die when the NM dies? e.g. On Windows, NM can use JobObject's to make this happen.

          Show
          Bikas Saha added a comment - Pids might get reused. so we need to guard against that. Is there any generic Linux OS mechanism that would make the running containers die when the NM dies? e.g. On Windows, NM can use JobObject's to make this happen.
          Hide
          Hitesh Shah added a comment -

          NM should be able to access container.pid files to kill existing containers before the files are cleaned up following an unclean shutdown.

          Show
          Hitesh Shah added a comment - NM should be able to access container.pid files to kill existing containers before the files are cleaned up following an unclean shutdown.

            People

            • Assignee:
              Sandy Ryza
              Reporter:
              Hitesh Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development