Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Patch, Important
Description
We've been seeing some random errors on our cluster, where the container cgroup isn't properly destroyed after the OOM killer kicked in when memory limit has been exceeded - see analysis and patch below:
Agent log:
I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 2c2a31eb-bac5-4acd-82ee-593c4616a63c I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 10272MB Maximum Used: 10518532KB MEMORY STATISTICS: cache 0 rss 10502754304 rss_huge 4001366016 shmem 0 mapped_file 270336 dirty 0 writeback 0 swap 0 pgpgin 1684617 pgpgout 95480 pgfault 1670328 pgmajfault 957 inactive_anon 0 active_anon 10501189632 inactive_file 4096 active_file 0 unevictable 0 hierarchical_memory_limit 10770972672 hierarchical_memsw_limit 10770972672 total_cache 0 total_rss 10502754304 total_rss_huge 4001366016 total_shmem 0 total_mapped_file 270336 total_dirty 0 total_writeback 0 total_swap 0 total_pgpgin 1684617 total_pgpgout 95480 total_pgfault 1670328 total_pgmajfault 957 total_inactive_anon 0 total_active_anon 10501070848 total_inactive_file 4096 total_active_file 0 total_unevictable 0 I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be terminated I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING after 4.285078272secs I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy container 2c2a31eb-bac5-4acd-82ee-593c4616a63c I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 102.27072ms I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 242944ns I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for executor(1)@127.0.1.1:46357 I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device or resource busy
Initially I thought it was a race condition in the cgroup destruction code, but an strace confirmed that the cgroup directory was only deleted once all tasks had exited (edited and commented strace below from a different instance of the same problem):
# get the list of processes 3431 23:01:32.293608 openat(AT_FDCWD, "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", O_RDONLY <unfinished ...> 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.000036> 3431 23:01:32.294220 read(18, <unfinished ...> 3431 23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) = 15 <0.000033> 3431 23:01:32.294306 read(18, "", 4096) = 0 <0.000013> 3431 23:01:32.294346 close(18 <unfinished ...> 3431 23:01:32.294402 <... close resumed> ) = 0 <0.000045> #kill them 3431 23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019> 3431 23:01:32.296384 kill(6036, SIGKILL <unfinished ...> 3431 23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013> 6210 23:01:32.299159 +++ killed by SIGKILL +++ 5878 23:01:32.327551 +++ killed by SIGKILL +++ 3431 23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL, si_utime=18, si_stime=4} --- 6036 23:01:32.327923 +++ killed by SIGKILL +++ # wait them 3429 23:01:32.338912 wait4(5878, <unfinished ...> 3429 23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157> 3429 23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1 ECHILD (Aucun processus enfant) <0.000005> 3429 23:01:32.340938 wait4(6210, <unfinished ...> 3429 23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG, NULL) = -1 ECHILD (Aucun processus enfant) <0.000218> # check cgroup empty 3426 23:01:32.341490 openat(AT_FDCWD, "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", O_RDONLY <unfinished ...> 3426 23:01:32.342474 read(47, <unfinished ...> 3426 23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359> 3426 23:01:32.342940 close(47 <unfinished ...> 3426 23:01:32.344579 <... close resumed> ) = 0 <0.001631> # cgroup empty - attempt to delete 3428 23:01:32.345596 rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb" <unfinished ...> 3428 23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource busy) <0.000068>
So as can be seen, the cgroup destruction code correctly waits until the tasks has exited to destroy the cgroup, which is empty as can be seen by the code reading from cgroup.procs right before the rmdir.
It turns out that it's actually a kernel bug recently reported and fixed: https://lkml.org/lkml/2020/1/15/1349
See reproducer attached - it repeatedly starts a task consuming several GB of memory, with a memory limit below slightly below to trigger OOM.
It's a kernel bug, but since it affects virtually every kernel/Mesos currently in use I think it would make sense to add a workaround - simply retrying to delete the cgroup after a timeout fixes it for me, see patch attached.