Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    1. MAPREDUCE-4351-v7.sh
      0.5 kB
      Andrew Ferguson
    2. MAPREDUCE-4351-v7.patch
      23 kB
      Andrew Ferguson
    3. MAPREDUCE-4351-v6.patch
      89 kB
      Andrew Ferguson
    4. MAPREDUCE-4351-v5.patch
      88 kB
      Thomas Graves
    5. MAPREDUCE-4351-v5.patch
      88 kB
      Andrew Ferguson
    6. MAPREDUCE-4351-v4.patch
      87 kB
      Thomas Graves
    7. MAPREDUCE-4351-v4.patch
      87 kB
      Andrew Ferguson
    8. MAPREDUCE-4351-v3.patch
      86 kB
      Andrew Ferguson
    9. MAPREDUCE-4351-v2.patch
      78 kB
      Andrew Ferguson
    10. MAPREDUCE-4351-v1.patch
      75 kB
      Andrew Ferguson

      Issue Links

        Activity

        Hide
        Andrew Ferguson added a comment -

        This is the same patch as v6 (which passed the buildbot), split into a Bash script which does 'svn mv' and an actual patch, following the guidelines Tom White posted above.

        thanks,
        Andrew

        Show
        Andrew Ferguson added a comment - This is the same patch as v6 (which passed the buildbot), split into a Bash script which does 'svn mv' and an actual patch, following the guidelines Tom White posted above. thanks, Andrew
        Hide
        Andrew Ferguson added a comment -

        ok, that's what I thought you meant as an example, but wanted to double-check.

        handle() was one of the functions I wrote differently for CgroupsContainersMonitor. Therefore, I wouldn't think of it as a candidate to move to a base class.

        To me, handle() seems like the core of a ContainersMonitor – it receives START_MONITORING_CONTAINER and STOP_MONITORING_CONTAINER and decides what to do.

        Show
        Andrew Ferguson added a comment - ok, that's what I thought you meant as an example, but wanted to double-check. handle() was one of the functions I wrote differently for CgroupsContainersMonitor. Therefore, I wouldn't think of it as a candidate to move to a base class. To me, handle() seems like the core of a ContainersMonitor – it receives START_MONITORING_CONTAINER and STOP_MONITORING_CONTAINER and decides what to do.
        Hide
        Bikas Saha added a comment -

        I gave an example in the last comment.

          @Override
          public void handle(ContainersMonitorEvent monitoringEvent) {
        
        Show
        Bikas Saha added a comment - I gave an example in the last comment. @Override public void handle(ContainersMonitorEvent monitoringEvent) {
        Hide
        Andrew Ferguson added a comment -

        @Bikas, can you say a bit more regarding your comment? I'm not sure what you mean. When I implemented the CgroupsContainersMonitor, I didn't find that I was reproducing very much code from the DefaultContainersMonitor. Sure, both need to initialize the ResourceCalculatorPlugin and use it, but that was about it.

        thanks,
        Andrew

        Show
        Andrew Ferguson added a comment - @Bikas, can you say a bit more regarding your comment? I'm not sure what you mean. When I implemented the CgroupsContainersMonitor, I didn't find that I was reproducing very much code from the DefaultContainersMonitor. Sure, both need to initialize the ResourceCalculatorPlugin and use it, but that was about it. thanks, Andrew
        Hide
        Bikas Saha added a comment -

        Seems to me that there is some code in DefaultContainerMonitor that should move into a base class. E.g. no point requiring every derived class implement event handlers unless they want to change the handling mechanism such as a different threading model. Or one could consider DefaultContainerMonitor the base class with a complete implementation. In that case, a rename and more comments would be good.

        Show
        Bikas Saha added a comment - Seems to me that there is some code in DefaultContainerMonitor that should move into a base class. E.g. no point requiring every derived class implement event handlers unless they want to change the handling mechanism such as a different threading model. Or one could consider DefaultContainerMonitor the base class with a complete implementation. In that case, a rename and more comments would be good.
        Hide
        Andrew Ferguson added a comment -

        indeed, ContainersMonitor is not a great name for the current class. in my mind, it has four jobs:

        1) monitor resource usage
        2) enforce resource limits
        3) notify YARN of killed or failed tasks
        4) delete temporary files created by container executor

        while I see arguments to split each of these jobs into a separate class, I also see reasons to keep them together (a la the current ContainersMonitor): jobs #1 and #3 are similar (poll /proc), jobs #1 and #2 are related (in the current implementation, the results of #1 are used for #2), and #2 and #4 are related, as you may need to revert some resource settings after the job is complete.

        thoughts?

        thanks!

        Show
        Andrew Ferguson added a comment - indeed, ContainersMonitor is not a great name for the current class. in my mind, it has four jobs: 1) monitor resource usage 2) enforce resource limits 3) notify YARN of killed or failed tasks 4) delete temporary files created by container executor while I see arguments to split each of these jobs into a separate class, I also see reasons to keep them together (a la the current ContainersMonitor): jobs #1 and #3 are similar (poll /proc), jobs #1 and #2 are related (in the current implementation, the results of #1 are used for #2), and #2 and #4 are related, as you may need to revert some resource settings after the job is complete. thoughts? thanks!
        Hide
        Arun C Murthy added a comment -

        It seems like we have a couple of things we need to support:

        a) Ability for NM to monitor containers for resource usage (e.g. a container might use only 1G even if it was allocated 2G etc.).
        b) Ability for NM to enforce the resource limits.

        Thus, it might be better for this to be called ContainerResourceEnforcer or some such? We'd like ContainerMonitor to merely monitor containers.

        Show
        Arun C Murthy added a comment - It seems like we have a couple of things we need to support: a) Ability for NM to monitor containers for resource usage (e.g. a container might use only 1G even if it was allocated 2G etc.). b) Ability for NM to enforce the resource limits. Thus, it might be better for this to be called ContainerResourceEnforcer or some such? We'd like ContainerMonitor to merely monitor containers.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12534964/MAPREDUCE-4351-v6.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2543//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2543//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534964/MAPREDUCE-4351-v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2543//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2543//console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        Looks like a fairly large patch, looking now.

        Show
        Arun C Murthy added a comment - Looks like a fairly large patch, looking now.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12534964/MAPREDUCE-4351-v6.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2542//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2542//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534964/MAPREDUCE-4351-v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2542//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2542//console This message is automatically generated.
        Hide
        Tom White added a comment -

        Forgot to say that HADOOP-7560 has an example of a mv script + patch.

        Show
        Tom White added a comment - Forgot to say that HADOOP-7560 has an example of a mv script + patch.
        Hide
        Tom White added a comment -

        Overall this looks like a reasonable refactor to me.

        When the patch is committed we'll need a script that does 'svn mv' on the files that changed names (e.g. ContainersMonitorImpl.java -> DefaultContainerExecutor.java), so that history is preserved. This will be run before applying the patch. You can see how to do this at http://wiki.apache.org/hadoop/HowToContribute#Creating_a_patch, under the bit that says "If you need to rename files in your patch". A patch generated this way is also easier to review. However, Jenkins can't apply it so you might want to wait for a clear Jenkins run first.

        Show
        Tom White added a comment - Overall this looks like a reasonable refactor to me. When the patch is committed we'll need a script that does 'svn mv' on the files that changed names (e.g. ContainersMonitorImpl.java -> DefaultContainerExecutor.java), so that history is preserved. This will be run before applying the patch. You can see how to do this at http://wiki.apache.org/hadoop/HowToContribute#Creating_a_patch , under the bit that says "If you need to rename files in your patch". A patch generated this way is also easier to review. However, Jenkins can't apply it so you might want to wait for a clear Jenkins run first.
        Hide
        Andrew Ferguson added a comment -

        Thanks, Tom. This version updates findbugs-exclude.xml

        Show
        Andrew Ferguson added a comment - Thanks, Tom. This version updates findbugs-exclude.xml
        Hide
        Tom White added a comment -

        Andrew - you should be able to change hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml to exclude the cast.

        Show
        Tom White added a comment - Andrew - you should be able to change hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml to exclude the cast.
        Hide
        Andrew Ferguson added a comment -

        hi Todd,

        Unless I'm mistaken, findbugs is referring to the cast to ContainerStartMonitoringEvent in DefaultContainersMonitor.java, which is simply the new name for the old ContainersMonitorImpl.java. I modeled this after the existing "DefaultContainerExecutor.java / LinuxContainerExecutor.java" naming, calling the file in MAPREDUCE-4334 CgroupsContainersMonitor.java

        I'm assuming findbugs isn't smart enough to realize that it's complaining about code which previously existed in a different file, no?

        thanks!
        Andrew

        Show
        Andrew Ferguson added a comment - hi Todd, Unless I'm mistaken, findbugs is referring to the cast to ContainerStartMonitoringEvent in DefaultContainersMonitor.java, which is simply the new name for the old ContainersMonitorImpl.java. I modeled this after the existing "DefaultContainerExecutor.java / LinuxContainerExecutor.java" naming, calling the file in MAPREDUCE-4334 CgroupsContainersMonitor.java I'm assuming findbugs isn't smart enough to realize that it's complaining about code which previously existed in a different file, no? thanks! Andrew
        Hide
        Todd Lipcon added a comment -

        Hey Andrew. Looks like the findbugs warning is indeed about the new code, though. Can you take a look?

        Show
        Todd Lipcon added a comment - Hey Andrew. Looks like the findbugs warning is indeed about the new code, though. Can you take a look?
        Hide
        Jason Lowe added a comment -

        The javadoc warnings are unrelated, see HDFS-3550. The javadoc build log for a Jenkins run can be found by examining the patchJavadocWarnings.txt log from the build artifacts. For example, click on the test report link, then Status on the sidebar, then expand the build artifacts and click on the patchJavadocWarnings.txt to see the log.

        Show
        Jason Lowe added a comment - The javadoc warnings are unrelated, see HDFS-3550 . The javadoc build log for a Jenkins run can be found by examining the patchJavadocWarnings.txt log from the build artifacts. For example, click on the test report link, then Status on the sidebar, then expand the build artifacts and click on the patchJavadocWarnings.txt to see the log.
        Hide
        Andrew Ferguson added a comment -

        For an example of how this is used, please see my patch in MAPREDUCE-4334, which provides an alternative ContainersMonitor that enforces limits using cgroups.

        re: the current patch. I can't tell if the javadoc warnings are real; perhaps someone more experienced can take a look? The findbugs warning is spurious.

        thanks!

        Show
        Andrew Ferguson added a comment - For an example of how this is used, please see my patch in MAPREDUCE-4334 , which provides an alternative ContainersMonitor that enforces limits using cgroups. re: the current patch. I can't tell if the javadoc warnings are real; perhaps someone more experienced can take a look? The findbugs warning is spurious. thanks!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532868/MAPREDUCE-4351-v5.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 javadoc. The javadoc tool appears to have generated 13 warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532868/MAPREDUCE-4351-v5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 13 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2490//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        reattaching same patch to kick jenkins.

        sorry looks like I missed that box. I made sure its installed on all the slaves for mapreduce now.

        Show
        Thomas Graves added a comment - reattaching same patch to kick jenkins. sorry looks like I missed that box. I made sure its installed on all the slaves for mapreduce now.
        Hide
        Andrew Ferguson added a comment -

        Another libstdc++ failure from pipes.

        Show
        Andrew Ferguson added a comment - Another libstdc++ failure from pipes.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532778/MAPREDUCE-4351-v5.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2488//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532778/MAPREDUCE-4351-v5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2488//console This message is automatically generated.
        Hide
        Andrew Ferguson added a comment -

        Add documentation to ContainersMonitor

        Show
        Andrew Ferguson added a comment - Add documentation to ContainersMonitor
        Hide
        Andrew Ferguson added a comment -

        .bq Oh also please look into the findbugs and javadoc warnings.

        The findbugs warning is spurious; it's from existing code and triggered by the rename of the file. (the cast happens inside a switch block on monitoringEvent.getType())

        Not sure about the javadoc warnings yet.

        Show
        Andrew Ferguson added a comment - .bq Oh also please look into the findbugs and javadoc warnings. The findbugs warning is spurious; it's from existing code and triggered by the rename of the file. (the cast happens inside a switch block on monitoringEvent.getType()) Not sure about the javadoc warnings yet.
        Hide
        Andrew Ferguson added a comment -

        Hi Robert,

        Thanks for looking at the patch.

        I am a bit confused as to why the creation of the ContainersMonitor was moved from ContainerManagerImpl to NodeManager. The NodeManager does not appear to have any need for it.

        When I put loading ContainersMonitor into ContainerManagerImpl, a number of tests failed. It seemed to want to be added as a service before ContainerManagerImpl was created, but I don't have a full understanding of the services and how they relate, so this may be have been the wrong move. This way, it also matches the way ContainerExecutor is loaded, or how other services such as the NodeStatusUpdater are started.

        I would also like to see some more documentation about how a ContainersMonitor is supposed to behave. The only API in there is setup. It would be nice to be able to document what events a ContaiersMonitor is expected to handle (START_MONITORING_CONTAINER, STOP_MONITORING_CONTAINER) and how it may optionally stop a misbehaving container ContainerKillEvent.

        Great point, thanks. I will add documentation.

        Andrew

        Show
        Andrew Ferguson added a comment - Hi Robert, Thanks for looking at the patch. I am a bit confused as to why the creation of the ContainersMonitor was moved from ContainerManagerImpl to NodeManager. The NodeManager does not appear to have any need for it. When I put loading ContainersMonitor into ContainerManagerImpl, a number of tests failed. It seemed to want to be added as a service before ContainerManagerImpl was created, but I don't have a full understanding of the services and how they relate, so this may be have been the wrong move. This way, it also matches the way ContainerExecutor is loaded, or how other services such as the NodeStatusUpdater are started. I would also like to see some more documentation about how a ContainersMonitor is supposed to behave. The only API in there is setup. It would be nice to be able to document what events a ContaiersMonitor is expected to handle (START_MONITORING_CONTAINER, STOP_MONITORING_CONTAINER) and how it may optionally stop a misbehaving container ContainerKillEvent. Great point, thanks. I will add documentation. Andrew
        Hide
        Robert Joseph Evans added a comment -

        @Arun,

        Having cgroups will block the container in, but there are other use cases.

        1. There will be some OSs we want to support that do not have very good process group isolation so having this pluggable allows us to put in a NullMonitor for cgroups and a default/polling monitor for older versions of Linux.
        2. There are also cases where we may only want to monitor the resource and not shoot the process. There was some discussion about providing feedback to the scheduler about what resources each container is actually using so that it can potentially over commit. Even in the case of cgroups we would need something to be able to get that information and return it to the RM.

        There is some work that needs to go into figuring out exactly how we want to monitor resources vs. enforce resources, and how we want to support this for different resources. I did not see much of an issue with putting in a small amount of refactoring so that others can start experimenting with the stipulation that everything may change in the future, @Unstable. If you disagree I am fine with holding off on it and starting that discussion now. Either here, on MAPREDUCE-4256, or perhaps in a new rollup JIRA for the full feature.

        Show
        Robert Joseph Evans added a comment - @Arun, Having cgroups will block the container in, but there are other use cases. There will be some OSs we want to support that do not have very good process group isolation so having this pluggable allows us to put in a NullMonitor for cgroups and a default/polling monitor for older versions of Linux. There are also cases where we may only want to monitor the resource and not shoot the process. There was some discussion about providing feedback to the scheduler about what resources each container is actually using so that it can potentially over commit. Even in the case of cgroups we would need something to be able to get that information and return it to the RM. There is some work that needs to go into figuring out exactly how we want to monitor resources vs. enforce resources, and how we want to support this for different resources. I did not see much of an issue with putting in a small amount of refactoring so that others can start experimenting with the stipulation that everything may change in the future, @Unstable. If you disagree I am fine with holding off on it and starting that discussion now. Either here, on MAPREDUCE-4256 , or perhaps in a new rollup JIRA for the full feature.
        Hide
        Andrew Ferguson added a comment -

        Hi Arun,

        I initially considered that approach (having ContainersMonitor go away), but decided against it for the following reasons:

        • Because the initial configuration of cgroups requires root access, some users may not be able to use them. Therefore, it makes sense to keep the current ContainersMonitor, which monitors memory usage and does its own OOM killing.
        • Maintaining the separation between the ContainerExecutor (which just does launching) and the ContainersMonitor (which just does enforcement) lowers the amount of code overlap for supporting each permutation: secure executor with/without cgroups, default executor with/without cgroups, a ContainersMonitor with taskset for CPU instead of cgroups, etc.
        • With this approach, the ContainersMonitor continues to receive the ContainerStopMonitoringEvent event, which allows it to delete cgroups which are no longer in use.
        • Finally, the current ContainersMonitor does not start enforcing the memory limit until after a short delay because of the potential to double-count resources when the JVM does a fork()+exec() – see comment above isProcessTreeOverLimit(). Keeping resource enforcement separate from process launching gives us the flexibility to delay the start of enforcement if needed.

        do these reasons make sense?

        I'm currently writing a "CgroupsContainersMonitor" to complement the "DefaultContainersMonitor," and which plugs-in using this patch. As you point out, this CgroupsContainerMonitor doesn't do much work: it just configures cgroups and places processes in them, allowing the kernel to do the actual enforcement work.

        thanks!
        Andrew

        Show
        Andrew Ferguson added a comment - Hi Arun, I initially considered that approach (having ContainersMonitor go away), but decided against it for the following reasons: Because the initial configuration of cgroups requires root access, some users may not be able to use them. Therefore, it makes sense to keep the current ContainersMonitor, which monitors memory usage and does its own OOM killing. Maintaining the separation between the ContainerExecutor (which just does launching) and the ContainersMonitor (which just does enforcement) lowers the amount of code overlap for supporting each permutation: secure executor with/without cgroups, default executor with/without cgroups, a ContainersMonitor with taskset for CPU instead of cgroups, etc. With this approach, the ContainersMonitor continues to receive the ContainerStopMonitoringEvent event, which allows it to delete cgroups which are no longer in use. Finally, the current ContainersMonitor does not start enforcing the memory limit until after a short delay because of the potential to double-count resources when the JVM does a fork()+exec() – see comment above isProcessTreeOverLimit(). Keeping resource enforcement separate from process launching gives us the flexibility to delay the start of enforcement if needed. do these reasons make sense? I'm currently writing a "CgroupsContainersMonitor" to complement the "DefaultContainersMonitor," and which plugs-in using this patch. As you point out, this CgroupsContainerMonitor doesn't do much work: it just configures cgroups and places processes in them, allowing the kernel to do the actual enforcement work. thanks! Andrew
        Hide
        Arun C Murthy added a comment -

        I'm sorry to come in late...

        I'm a little confused by this approach, so please bear with me.

        It seems to me that ContainersMonitory should actually go away once we goto cgroups etc., since the cgroup itself would 'box' the process in, there-by removing the need for any monitoring from NodeManager. If so, this would be not required at all?

        Thus, once we go via the cgroups route, the LCE should do the necessary work to put the process in the cgroup. This means that we will need different versions of LCE for rhel5/centos5, rhel6/centos6 etc.

        Show
        Arun C Murthy added a comment - I'm sorry to come in late... I'm a little confused by this approach, so please bear with me. It seems to me that ContainersMonitory should actually go away once we goto cgroups etc., since the cgroup itself would 'box' the process in, there-by removing the need for any monitoring from NodeManager. If so, this would be not required at all? Thus, once we go via the cgroups route, the LCE should do the necessary work to put the process in the cgroup. This means that we will need different versions of LCE for rhel5/centos5, rhel6/centos6 etc.
        Hide
        Robert Joseph Evans added a comment -

        Oh also please look into the findbugs and javadoc warnings.

        Show
        Robert Joseph Evans added a comment - Oh also please look into the findbugs and javadoc warnings.
        Hide
        Robert Joseph Evans added a comment -

        I have been looking a little more closely at the patch. I am a bit confused as to why the creation of the ContainersMonitor was moved from ContainerManagerImpl to NodeManager. The NodeManager does not appear to have any need for it.

        to Answer your question

        One point I would like feedback on: ContainerManagerImpl's getContainersMonitor() is only used in NodeManager's init() as part of the call to createWebServer() — but with this patch, the ContainersMonitor is now created locally and passed to the ContainerManager in NodeManager's init(). Does it make sense to remove this function?

        I don't really see a need to change it at this point, especially if we push the creation of the ContainersMonitor back into the ContainerManagerImpl.

        I would also like to see some more documentation about how a ContainersMonitor is supposed to behave. The only API in there is setup. It would be nice to be able to document what events a ContaiersMonitor is expected to handle (START_MONITORING_CONTAINER, STOP_MONITORING_CONTAINER) and how it may optionally stop a misbehaving container ContainerKillEvent.

        Show
        Robert Joseph Evans added a comment - I have been looking a little more closely at the patch. I am a bit confused as to why the creation of the ContainersMonitor was moved from ContainerManagerImpl to NodeManager. The NodeManager does not appear to have any need for it. to Answer your question One point I would like feedback on: ContainerManagerImpl's getContainersMonitor() is only used in NodeManager's init() as part of the call to createWebServer() — but with this patch, the ContainersMonitor is now created locally and passed to the ContainerManager in NodeManager's init(). Does it make sense to remove this function? I don't really see a need to change it at this point, especially if we push the creation of the ContainersMonitor back into the ContainerManagerImpl. I would also like to see some more documentation about how a ContainersMonitor is supposed to behave. The only API in there is setup. It would be nice to be able to document what events a ContaiersMonitor is expected to handle (START_MONITORING_CONTAINER, STOP_MONITORING_CONTAINER) and how it may optionally stop a misbehaving container ContainerKillEvent.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532727/MAPREDUCE-4351-v4.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 javadoc. The javadoc tool appears to have generated 13 warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532727/MAPREDUCE-4351-v4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 13 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2484//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        reattaching same patch to kick jenkins as its fixed now.

        Show
        Thomas Graves added a comment - reattaching same patch to kick jenkins as its fixed now.
        Hide
        Thomas Graves added a comment -

        this is from the mavenization of pipes. Looks like build machine is missing 32 bit library. I am working on fixing.

        Show
        Thomas Graves added a comment - this is from the mavenization of pipes. Looks like build machine is missing 32 bit library. I am working on fixing.
        Hide
        Andrew Ferguson added a comment -

        Does not appear to be my build error:

        [exec] [ 57%] Building CXX object CMakeFiles/pipes-sort.dir/main/native/examples/impl/sort.cc.o
        [exec] /usr/bin/c++ -g -Wall -O2 -D_REENTRANT -D_FILE_OFFSET_BITS=64 -m32 -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/utils/api -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/pipes/api -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src -o CMakeFiles/pipes-sort.dir/main/native/examples/impl/sort.cc.o -c /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/examples/impl/sort.cc
        [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+.so when searching for -lstdc+
        [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+.a when searching for -lstdc+
        [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+.so when searching for -lstdc+
        [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+.a when searching for -lstdc+
        [exec] /usr/bin/ld: cannot find -lstdc++
        [exec] collect2: ld returned 1 exit status
        [exec] make[2]: *** [pipes-sort] Error 1
        [exec] make[1]: *** [CMakeFiles/pipes-sort.dir/all] Error 2
        [exec] make: *** [all] Error 2

        Show
        Andrew Ferguson added a comment - Does not appear to be my build error: [exec] [ 57%] Building CXX object CMakeFiles/pipes-sort.dir/main/native/examples/impl/sort.cc.o [exec] /usr/bin/c++ -g -Wall -O2 -D_REENTRANT -D_FILE_OFFSET_BITS=64 -m32 -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/utils/api -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/pipes/api -I/home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src -o CMakeFiles/pipes-sort.dir/main/native/examples/impl/sort.cc.o -c /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-tools/hadoop-pipes/src/main/native/examples/impl/sort.cc [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+ .so when searching for -lstdc + [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+ .a when searching for -lstdc + [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+ .so when searching for -lstdc + [exec] /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.4.3/libstdc+ .a when searching for -lstdc + [exec] /usr/bin/ld: cannot find -lstdc++ [exec] collect2: ld returned 1 exit status [exec] make [2] : *** [pipes-sort] Error 1 [exec] make [1] : *** [CMakeFiles/pipes-sort.dir/all] Error 2 [exec] make: *** [all] Error 2
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532618/MAPREDUCE-4351-v4.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2480//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532618/MAPREDUCE-4351-v4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2480//console This message is automatically generated.
        Hide
        Andrew Ferguson added a comment -

        Add @Unstable annotations as per Robert Joseph Evans

        Show
        Andrew Ferguson added a comment - Add @Unstable annotations as per Robert Joseph Evans
        Hide
        Robert Joseph Evans added a comment -

        MAPREDUCE-4256 Is trying to do something similar to this, but splitting the resource monitoring up on a per resource basis. In light of that until we decided how we want monitoring of different resources to look I would like to see NM_CONTAINERS_MONITOR and ContainersMonitor marked as @Unstable. Just so that it is obvious that these are likely to change in the future.

        The rest of the patch looks good to me, but I have not dug into it in great detail.

        Show
        Robert Joseph Evans added a comment - MAPREDUCE-4256 Is trying to do something similar to this, but splitting the resource monitoring up on a per resource basis. In light of that until we decided how we want monitoring of different resources to look I would like to see NM_CONTAINERS_MONITOR and ContainersMonitor marked as @Unstable. Just so that it is obvious that these are likely to change in the future. The rest of the patch looks good to me, but I have not dug into it in great detail.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532593/MAPREDUCE-4351-v3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 javadoc. The javadoc tool appears to have generated 13 warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

        org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
        org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestDefaultContainersMonitor
        org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
        org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532593/MAPREDUCE-4351-v3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 13 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestDefaultContainersMonitor org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2476//console This message is automatically generated.
        Hide
        Andrew Ferguson added a comment -

        build should be fixed – sorry for the spam. forgot to run 'mvn clean' before.

        Show
        Andrew Ferguson added a comment - build should be fixed – sorry for the spam. forgot to run 'mvn clean' before.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532589/MAPREDUCE-4351-v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2475//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532589/MAPREDUCE-4351-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2475//console This message is automatically generated.
        Hide
        Andrew Ferguson added a comment -

        One point I would like feedback on: ContainerManagerImpl's getContainersMonitor() is only used in NodeManager's init() as part of the call to createWebServer() — but with this patch, the ContainersMonitor is now created locally and passed to the ContainerManager in NodeManager's init(). Does it make sense to remove this function?

        thanks!

        Show
        Andrew Ferguson added a comment - One point I would like feedback on: ContainerManagerImpl's getContainersMonitor() is only used in NodeManager's init() as part of the call to createWebServer() — but with this patch, the ContainersMonitor is now created locally and passed to the ContainerManager in NodeManager's init(). Does it make sense to remove this function? thanks!
        Hide
        Andrew Ferguson added a comment -

        passes tests locally

        Show
        Andrew Ferguson added a comment - passes tests locally
        Hide
        Andrew Ferguson added a comment -

        well, these test failures appear to be from this. I'm taking a look.

        Show
        Andrew Ferguson added a comment - well, these test failures appear to be from this. I'm taking a look.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532489/MAPREDUCE-4351-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 javadoc. The javadoc tool appears to have generated 13 warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

        org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532489/MAPREDUCE-4351-v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 13 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2472//console This message is automatically generated.
        Hide
        Andrew Ferguson added a comment -

        the bulk of the lines in the patch are to rename ContainersMonitorImpl.java to DefaultContainersMonitor.java, and TestContainersMonitor.java to TestDefaultContainersMonitor.java

        Show
        Andrew Ferguson added a comment - the bulk of the lines in the patch are to rename ContainersMonitorImpl.java to DefaultContainersMonitor.java, and TestContainersMonitor.java to TestDefaultContainersMonitor.java
        Hide
        Andrew Ferguson added a comment -

        First cut at making ContainersMonitor pluggable. I have tested that the new configuration option is used, and that it works with a local cluster.

        Show
        Andrew Ferguson added a comment - First cut at making ContainersMonitor pluggable. I have tested that the new configuration option is used, and that it works with a local cluster.

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrew Ferguson
          • Votes:
            0 Vote for this issue
            Watchers:
            26 Start watching this issue

            Dates

            • Created:
              Updated:

              Development