Description
When you run with docker and enable cgroups for cpu, docker creates cgroups for all resources on the system, not just for cpu. For instance, if the yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn, the nodemanager will create a cgroup for each container under /sys/fs/cgroup/cpu/hadoop-yarn. In the docker case, we pass this path via the --cgroup-parent command line argument. Docker then creates a cgroup for the docker container under that, for instance: /sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id.
When the container exits, docker cleans up the docker_container_id cgroup, and the nodemanager cleans up the container_id cgroup, All is good under /sys/fs/cgroup/hadoop-yarn.
The problem is that docker also creates that same hierarchy under every resource under /sys/fs/cgroup. On the rhel7 system I am using, these are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, perf_event, and systemd. So for instance, docker creates /sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id, but it only cleans up the leaf cgroup docker_container_id. Nobody cleans up the container_id cgroups for these other resources. On one of our busy clusters, we found > 100,000 of these leaked cgroups.
I found this in our 2.8-based version of hadoop, but I have been able to repro with current hadoop.