Details

    • Target Version/s:

      Description

      When the NM is shutting down with restart support enabled there are scenarios we'd like to distinguish and behave accordingly:

      1. The NM is running under supervision. In that case containers should be preserved so the automatic restart can recover them.
      2. The NM is not running under supervision and a rolling upgrade is not being performed. In that case the shutdown should kill all containers since it is unlikely the NM will be restarted in a timely manner to recover them.
      3. The NM is not running under supervision and a rolling upgrade is being performed. In that case the shutdown should not kill all containers since a restart is imminent due to the rolling upgrade and the containers will be recovered.
      1. YARN-2331.patch
        12 kB
        Jason Lowe
      2. YARN-2331v2.patch
        14 kB
        Jason Lowe
      3. YARN-2331v3.patch
        13 kB
        Jason Lowe

        Activity

        Hide
        jlowe Jason Lowe added a comment -

        We can distinguish between supervised/unsupervised via a config. Determining whether an unsupervised shutdown is due to a rolling upgrade is a bit trickier. Some of the options there include:

        • Add an admin port to NMs and a corresponding CLI command to send commands to the port. There's a lot of boilerplate that goes along with this, but it is the most flexible option if we ever want to add other admin commands to an NM.
        • Add a REST API to do this (with appropriate authentication to make sure not just anyone can cause an NM shutdown)
        • Use another signal handler to indicate the shutdown just like the SIGTERM handler today for a normal shutdown but for another signal like SIGINT. The shell scripts could have a new command that would perform the rolling upgrade shutdown with the new signal rather than SIGTERM. This would be relatively simple to implement on POSIX platforms like Linux but has portability ramifications for non-POSIX platforms like Windows.
        Show
        jlowe Jason Lowe added a comment - We can distinguish between supervised/unsupervised via a config. Determining whether an unsupervised shutdown is due to a rolling upgrade is a bit trickier. Some of the options there include: Add an admin port to NMs and a corresponding CLI command to send commands to the port. There's a lot of boilerplate that goes along with this, but it is the most flexible option if we ever want to add other admin commands to an NM. Add a REST API to do this (with appropriate authentication to make sure not just anyone can cause an NM shutdown) Use another signal handler to indicate the shutdown just like the SIGTERM handler today for a normal shutdown but for another signal like SIGINT. The shell scripts could have a new command that would perform the rolling upgrade shutdown with the new signal rather than SIGTERM. This would be relatively simple to implement on POSIX platforms like Linux but has portability ramifications for non-POSIX platforms like Windows.
        Hide
        jlowe Jason Lowe added a comment -

        Another possible approach is to have the NM always try to cleanup containers on a shutdown when it is unsupervised. If a rolling upgrade needs to be performed and thus containers need to be preserved, the NM would be killed without the chance to cleanup (e.g.: kill -9 to deliver a SIGKILL). Upon restart the NM would recover the state from the state store and reacquire the containers.

        Show
        jlowe Jason Lowe added a comment - Another possible approach is to have the NM always try to cleanup containers on a shutdown when it is unsupervised. If a rolling upgrade needs to be performed and thus containers need to be preserved, the NM would be killed without the chance to cleanup (e.g.: kill -9 to deliver a SIGKILL). Upon restart the NM would recover the state from the state store and reacquire the containers.
        Hide
        djp Junping Du added a comment -

        Jason Lowe, for rollup when NM is not supervised, I think another way is to add a command line in RM Admin to bring down specific NM without killing containers (by notifying RMNode and heartbeat back) given no admin port to NM so far. The NM services shutdown (no matter decommission or failed occasionally) without supervised won't trigger this CLI so won't preserve running containers. Thoughts?

        Show
        djp Junping Du added a comment - Jason Lowe , for rollup when NM is not supervised, I think another way is to add a command line in RM Admin to bring down specific NM without killing containers (by notifying RMNode and heartbeat back) given no admin port to NM so far. The NM services shutdown (no matter decommission or failed occasionally) without supervised won't trigger this CLI so won't preserve running containers. Thoughts?
        Hide
        jlowe Jason Lowe added a comment -

        Ideally the process is self-contained on the NM node so once it has shutdown without killing containers it can be immediately restarted on the new release to minimize the period where the NM is not responding. I suppose we could have the the shutdown/upgrade script on the NM issue the rmadmin command then wait for the NM to receive the RM command and exit.

        I think it would be cleaner if we didn't have to involve the RM. However I don't feel so strongly that I'd object if we can't find a nice way to do this with just the NM node.

        Show
        jlowe Jason Lowe added a comment - Ideally the process is self-contained on the NM node so once it has shutdown without killing containers it can be immediately restarted on the new release to minimize the period where the NM is not responding. I suppose we could have the the shutdown/upgrade script on the NM issue the rmadmin command then wait for the NM to receive the RM command and exit. I think it would be cleaner if we didn't have to involve the RM. However I don't feel so strongly that I'd object if we can't find a nice way to do this with just the NM node.
        Hide
        jlowe Jason Lowe added a comment -

        In the interest of getting something done for this in time for 2.6, here's a patch that adds a conf to tell the NM whether or not it's supervised. If supervised then it is expected to be quickly restarted on shutdown and will not kill containers. If unsupervised then it will kill containers on shutdown since it does not expect to be restarted in a timely manner.

        Show
        jlowe Jason Lowe added a comment - In the interest of getting something done for this in time for 2.6, here's a patch that adds a conf to tell the NM whether or not it's supervised. If supervised then it is expected to be quickly restarted on shutdown and will not kill containers. If unsupervised then it will kill containers on shutdown since it does not expect to be restarted in a timely manner.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12673377/YARN-2331.patch
        against trunk revision 2e789eb.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

        org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery
        org.apache.hadoop.yarn.server.nodemanager.TestContainerManagerWithLCE

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5308//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5308//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673377/YARN-2331.patch against trunk revision 2e789eb. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery org.apache.hadoop.yarn.server.nodemanager.TestContainerManagerWithLCE +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5308//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5308//console This message is automatically generated.
        Hide
        jlowe Jason Lowe added a comment -

        Updated patch to fix the unit tests.

        Show
        jlowe Jason Lowe added a comment - Updated patch to fix the unit tests.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12673407/YARN-2331v2.patch
        against trunk revision 9196db9.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5312//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5312//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673407/YARN-2331v2.patch against trunk revision 9196db9. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5312//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5312//console This message is automatically generated.
        Hide
        djp Junping Du added a comment -

        Thanks Jason Lowe for the patch. One thing I want to confirm here is: after this patch, if we setting "yarn.nodemanager.recovery.enabled" to true but setting "yarn.nodemanager.recovery.supervised" to false, we can still keep container running if we kill NM daemon by "kill -9" but go through "yarn-daemon.sh stop nodemanager" will kill running containers. Isn't it?

        Show
        djp Junping Du added a comment - Thanks Jason Lowe for the patch. One thing I want to confirm here is: after this patch, if we setting "yarn.nodemanager.recovery.enabled" to true but setting "yarn.nodemanager.recovery.supervised" to false, we can still keep container running if we kill NM daemon by "kill -9" but go through "yarn-daemon.sh stop nodemanager" will kill running containers. Isn't it?
        Hide
        jlowe Jason Lowe added a comment -

        Yes, the patch implements the "Another possible approach" comment. This is how we're currently managing NM restarts, as we don't run our NMs under supervision.

        Show
        jlowe Jason Lowe added a comment - Yes, the patch implements the "Another possible approach" comment . This is how we're currently managing NM restarts, as we don't run our NMs under supervision.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 14m 39s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        -1 javac 2m 58s The patch appears to cause the build to fail.



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12673407/YARN-2331v2.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / e8d0ee5
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/7666/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 39s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. -1 javac 2m 58s The patch appears to cause the build to fail. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12673407/YARN-2331v2.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / e8d0ee5 Console output https://builds.apache.org/job/PreCommit-YARN-Build/7666/console This message was automatically generated.
        Hide
        xgong Xuan Gong added a comment -

        Cancel the patch since it does not apply anymore

        Show
        xgong Xuan Gong added a comment - Cancel the patch since it does not apply anymore
        Hide
        xgong Xuan Gong added a comment -

        Jason Lowe Could you rebase the patch, please ?

        Probably, we could set the default value for yarn.nodemanager.recovery.supervised as true. Normally, when people add a node as NM, they expect to use this node for a long time. So, restart is expected ?

        Show
        xgong Xuan Gong added a comment - Jason Lowe Could you rebase the patch, please ? Probably, we could set the default value for yarn.nodemanager.recovery.supervised as true. Normally, when people add a node as NM, they expect to use this node for a long time. So, restart is expected ?
        Hide
        jlowe Jason Lowe added a comment -

        Updated patch to trunk.

        Probably, we could set the default value for yarn.nodemanager.recovery.supervised as true. Normally, when people add a node as NM, they expect to use this node for a long time. So, restart is expected ?

        The problem is if the NM is not being supervised then when it goes down there isn't going to be a timely restart. That will leave containers unmanaged on the node (e.g.: can't be killed by YARN since NM is down). The user may eventually get around to restarting the NM, but if that takes hours or days that doesn't help so much.

        Before NM restart, the NM would try to kill all active containers on shutdown to prevent this. With restart this is undesireable unless the NM is going down and isn't going to be started in a timely manner (i.e.: this isn't a upgrade or NM isn't being supervised).

        Show
        jlowe Jason Lowe added a comment - Updated patch to trunk. Probably, we could set the default value for yarn.nodemanager.recovery.supervised as true. Normally, when people add a node as NM, they expect to use this node for a long time. So, restart is expected ? The problem is if the NM is not being supervised then when it goes down there isn't going to be a timely restart. That will leave containers unmanaged on the node (e.g.: can't be killed by YARN since NM is down). The user may eventually get around to restarting the NM, but if that takes hours or days that doesn't help so much. Before NM restart, the NM would try to kill all active containers on shutdown to prevent this. With restart this is undesireable unless the NM is going down and isn't going to be started in a timely manner (i.e.: this isn't a upgrade or NM isn't being supervised).
        Hide
        xgong Xuan Gong added a comment -

        Thanks for explanation. Jason Lowe That makes sense to me.
        The patch LGTM. Kick the jenkins

        Show
        xgong Xuan Gong added a comment - Thanks for explanation. Jason Lowe That makes sense to me. The patch LGTM. Kick the jenkins
        Hide
        hadoopqa Hadoop QA added a comment -



        +1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 14m 39s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 33s There were no new javac warning messages.
        +1 javadoc 9m 39s There were no new javadoc warning messages.
        +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 2m 19s There were no new checkstyle issues.
        +1 whitespace 0m 1s The patch has no lines that end in whitespace.
        +1 install 1m 38s mvn install still works.
        +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
        +1 findbugs 3m 46s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
        +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api.
        +1 yarn tests 1m 57s Tests passed in hadoop-yarn-common.
        +1 yarn tests 6m 0s Tests passed in hadoop-yarn-server-nodemanager.
            48m 55s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12731536/YARN-2331v3.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / f523e96
        hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-api.txt
        hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-common.txt
        hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7824/testReport/
        Java 1.7.0_55
        uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/7824/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 39s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 33s There were no new javac warning messages. +1 javadoc 9m 39s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 2m 19s There were no new checkstyle issues. +1 whitespace 0m 1s The patch has no lines that end in whitespace. +1 install 1m 38s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 3m 46s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api. +1 yarn tests 1m 57s Tests passed in hadoop-yarn-common. +1 yarn tests 6m 0s Tests passed in hadoop-yarn-server-nodemanager.     48m 55s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12731536/YARN-2331v3.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f523e96 hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/7824/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7824/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/7824/console This message was automatically generated.
        Hide
        xgong Xuan Gong added a comment -

        +1. Will commit

        Show
        xgong Xuan Gong added a comment - +1. Will commit
        Hide
        xgong Xuan Gong added a comment -

        Committed into trunk/branch-2. Thanks, Jason!

        Show
        xgong Xuan Gong added a comment - Committed into trunk/branch-2. Thanks, Jason!
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/)
        YARN-2331. Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/ ) YARN-2331 . Distinguish shutdown during supervision vs. shutdown for (xgong: rev 088156de43abb07bec590a3fcd1a5af2feb02cd2) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        Hide
        Karthik Palaniappan Karthik Palaniappan added a comment -

        Toggling this new configuration property (yarn.nodemanager.recovery.supervised) isn't very different than just toggling the property that enables recovery (yarn.nodemanager.recovery.enabled). It's surprising that you now need to flip two properties to get NM work preservation to work.

        Is there a reason that you need to distinguish between a supervised NM shutdown and a rolling upgrade related shutdown?

        I'm complaining because the instructions in the 2.7 line are incorrect in 2.8: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html. Equivalent docs don't exist in the 2.8 line (i.e. if you change the url to be r2.8.2), so I couldn't find any documentation of this new property.

        Show
        Karthik Palaniappan Karthik Palaniappan added a comment - Toggling this new configuration property (yarn.nodemanager.recovery.supervised) isn't very different than just toggling the property that enables recovery (yarn.nodemanager.recovery.enabled). It's surprising that you now need to flip two properties to get NM work preservation to work. Is there a reason that you need to distinguish between a supervised NM shutdown and a rolling upgrade related shutdown? I'm complaining because the instructions in the 2.7 line are incorrect in 2.8: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html . Equivalent docs don't exist in the 2.8 line (i.e. if you change the url to be r2.8.2), so I couldn't find any documentation of this new property.
        Hide
        jlowe Jason Lowe added a comment -

        There is documentation of the property at https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, but I agree it could be better.

        Is there a reason that you need to distinguish between a supervised NM shutdown and a rolling upgrade related shutdown?

        Yes, in the sense that the two shutdowns may be different depending upon how the rolling upgrade shutdown was performed. For example, in our clusters we do not have direct supervision on the nodemanagers and instead have another tool that periodically comes along and services nodes that have fallen out of the cluster. That means the nodemanager will not necessarily be restarted in a timely manner if it crashes. In that case we want the nodemanager to shutdown cleanly during the crash, killing all running containers since otherwise they will be unsupervised and the RM will believe the containers are dead due to lack of NM heartbeats from this node. If the NM were under direct supervision then it will be restarted quickly after it crashes. In that scenario we would not want it to kill the containers and instead let the NM recover the containers upon restart.

        For rolling upgrades we kill the nodemanager with SIGKILL, preventing it from doing any cleanup processing. Then we restart the nodemanagers on the new software, and the nodemanager recovers the containers on startup. In our clusters the work preserving and supervised properties are set differently so the NM knows to support recovery yet still kill containers on shutdown. Before this change the NM would always kill containers on a shutdown, so it would be impossible to preserve work in the case where the NM threw an exception and performed an orderly shutdown yet the NM was under supervision.

        In 2.8 and later the nodemanager restart documentation moved to a unified nodemanager page, e.g.: https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/NodeManager.html, but it still doesn't describe this property. I filed YARN-7502 to update the nodemanager restart docs to cover this property and when it would be useful.

        Show
        jlowe Jason Lowe added a comment - There is documentation of the property at https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml , but I agree it could be better. Is there a reason that you need to distinguish between a supervised NM shutdown and a rolling upgrade related shutdown? Yes, in the sense that the two shutdowns may be different depending upon how the rolling upgrade shutdown was performed. For example, in our clusters we do not have direct supervision on the nodemanagers and instead have another tool that periodically comes along and services nodes that have fallen out of the cluster. That means the nodemanager will not necessarily be restarted in a timely manner if it crashes. In that case we want the nodemanager to shutdown cleanly during the crash, killing all running containers since otherwise they will be unsupervised and the RM will believe the containers are dead due to lack of NM heartbeats from this node. If the NM were under direct supervision then it will be restarted quickly after it crashes. In that scenario we would not want it to kill the containers and instead let the NM recover the containers upon restart. For rolling upgrades we kill the nodemanager with SIGKILL, preventing it from doing any cleanup processing. Then we restart the nodemanagers on the new software, and the nodemanager recovers the containers on startup. In our clusters the work preserving and supervised properties are set differently so the NM knows to support recovery yet still kill containers on shutdown. Before this change the NM would always kill containers on a shutdown, so it would be impossible to preserve work in the case where the NM threw an exception and performed an orderly shutdown yet the NM was under supervision. In 2.8 and later the nodemanager restart documentation moved to a unified nodemanager page, e.g.: https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/NodeManager.html , but it still doesn't describe this property. I filed YARN-7502 to update the nodemanager restart docs to cover this property and when it would be useful.
        Hide
        Karthik Palaniappan Karthik Palaniappan added a comment -

        Ah, thanks for the quick response.

        Show
        Karthik Palaniappan Karthik Palaniappan added a comment - Ah, thanks for the quick response.

          People

          • Assignee:
            jlowe Jason Lowe
            Reporter:
            jlowe Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development