Hi Bikas, thanks for thinking about this! Comments inline:
Somewhere in this thread it was mentioned controlling memory via OS. In my experience this is not an optimal choice because
1) makes it hard to debug task failures due to memory issues. Abrupt OS termination or denial or more memory resulting in NPE/bad pointers etc. Its better to just monitor the memory and then enforce limits with clear error message saying - task was terminated because it used more memory than alloted.
On Linux, enforcing memory limits via Cgroups feels a bit like simply running a process on a machine with less memory installed. When the memory allocation is pushing the threshold, the Linux OOM killer destroys the task. The patch above detects that the process has been killed and logs a error message indicating that the task was killed for consuming too many reousrces.
2) due to different scenarios, tasks may have memory spikes or temporary increases. The OS will enforce tight limits but NodeManager monitoring can be more flexible and not terminate a task because it shot to 2.1GB instead of staying under 2.
I would argue that the strict enforcement of Cgroups is exactly the behavior we want because it provides isolation. If two containers are running on a node with 4 GB of RAM, and each are using 2 GB, and one happens to spike to 3 GB momentarily, the spiking container should suffer – if we continue monitoring the memory as done today, then the well-behaved container might suffer by being swapped-out to make room for the spiking container.
I believe the spiking concern is mitigated by the fact that Cgroups allows you to set both a physical memory limit, and a virtual memory limit (which my patch above makes use of). For example, I set the physical memory limit to say, 1 GB of RAM, and the virtual memory limit to 2.1 GB. When a process momentarily spikes above it's 1 GB of RAM, it will be allocated memory from swap without a problem. This is configurable by the already extant "yarn.nodemanager.vmem-pmem-ratio" setting.
Disk scheduling and monitoring would be a hard to achieve goal with multiple writers to disk spinning things their own way and expecting something that will likely not happen.
Sure, it is tricky, and the feasibility depends on the semantics YARN promises applications. However, the Linux Completely Fair Queuing I/O scheduler has semantics which are quite similar to the semantics I'm proposing we promise for CPUs (proportional weights). The blkio Cgroup subsystem already today provides both proportional sharing and throttling: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch-Subsystems_and_Tunable_Parameters.html#sec-blkio
Network scheduling and monitoring shares choke points at multiple levels beyond the machines and trying to optimally and proportionally use the network tends to be a problem thats better served globally.
YARN is a global scheduler. Linux traffic controls , in combination with the network controller for Cgroups, can be used to implement the results of Seawall , FairCloud , and similar projects. There are many datacenter designs these days; some will be a perfect match for end-host-only bandwidth control, and others an imperfect match. While end-host-only bandwidth control is not a magic bullet, I strongly believe that it is both useful enough, and easy enough to implement, to warrant pursuit.
My 2 cents would be to limit this to just CPU for now.
It is. However, I believe the patch above is easily extensible to other resources (you can see for yourself that there is a small difference between the memory-only patch, and the memory+cpu patch).
Based on the comments above, I would agree that we need to make sure platform specific stuff should not leak into the code so that other platforms (imminently Windows) can support this stuff.
Totally agree. That's why I proposed making it pluggable with MAPREDUCE-4351.
An alternative to pluggable ContainersMonitor would be to make CPU management a pluggable component of ContainersManager. My POV is that ContainersManager manages the resources of containers and has logic that will be common across platforms. The tools it uses will change. Eg. ProcfsBaseProcessTree is the tool used to monitor and manage memory. I can see that being changed to a MemoryMonitor interface with platform specific implementations. Thats whats happening on the Windows port in branch 1. I can see a CPUMonitor interface for CPU. Or maybe a ResourceMonitor that has methods for both memory and CPU.
I'm afraid I'm a bit confused by your suggestion here – ContainersMonitor is already a part of the ContainersManager. Are you proposing that we create a pluggable interface for each type of resource independently? Perhaps you can point me to the code & branch which has the suggestion you are describing? There are two pieces to resource management: monitoring & enforcement, and both are platform-specific. Because multiple Linux enforcement solutions (the current Java-native, the above Cgroups, and the planned taskset) can all use the same Linux-specific monitoring code, it seems reasonable to keep the two features separate. The monitoring code is already pluggable (ResourceCalculatorPlugin).
 http://lartc.org/howto/ and 'man tc'