[MESOS-1689] Race with kernel to kill process / destroy cgroup after OOM - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.20.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

0.20.0
Sprint:
Twitter Mesos Q4 Sprint 2

Description

The recently refactored cgroup::destroy code can fail to freeze a freezer cgroup under a particular ordering of events, as detailed below. The LinuxLauncher will fail to destroy the cgroup and other isolators will not be able to destroy their cgroups.

This failure will be logged but otherwise ignored by a running slave. If the slave is subsequently restarted it will block on the cgroup::destroy during launcher recovery, timing out after 60 seconds and causing recovery to fail, which will then cause the slave to terminate. If the slave is monitored and automatically restarted it will repeatedly flap.

The problem appears as a container freezer cgroup that will not transition to FROZEN and remains in FREEZING. This is because one or more processes cannot be frozen.

[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state         FREEZING
[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs  | xargs ps -L
  PID   LWP TTY      STAT   TIME COMMAND
29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
29526 29526 ?        D      0:02 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29578 29578 ?        Ds    29:49 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r

Debugging with jieyu indicates the following sequence of events:
1. Cgroup reaches memory limit

2. Kernel notifies Mesos that an OOM condition has occured.

3. Mesos initiates freezing the cgroup.

4. There is now a race in the kernel between freezing processes and deciding which process to kill. The kernel does check a process is suitable for killing and will thaw the process if necessary but there is a window between this check and actually signaling the process. It can occur that the selected process is frozen before the signal is delivered and thus the process does not die, and therefore does not release its memory.

[idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child
Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child

[idownes@hostname ~]$ cat /proc/29578/status | grep State
State:  D (disk sleep)

[idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
SigPnd: 0000000000000100

[idownes@hostname ~]$ sudo cat /proc/29578/stack
[sudo] password for idownes:
[<ffffffff8107b17d>] __refrigerator+0xc9/0x125
[<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
[<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
[<ffffffff81002217>] do_signal+0x49/0x5ec
[<ffffffff810027e3>] do_notify_resume+0x29/0x65
[<ffffffff814abd12>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.

6. Meanwhile, another process in the cgroup tries to allocate a page, this triggers a page fault which fails because the cgroup is at its limit (the selected process hasn't been killed to free memory). This process keeps faulting and repeatedly jumps between runnable/running and uninterruptible. Regardless, it does not get frozen.

[idownes@hostname ~]$ sudo cat /proc/29603/stack
[<ffffffff81065246>] __cond_resched+0x26/0x30
[<ffffffff81125abd>] vmpressure+0x3b/0x7c
[<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
[<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
[<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
[<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
[<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
[<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
[<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
[<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
[<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
[<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
[<ffffffff814a82ba>] do_page_fault+0xe/0x10
[<ffffffff814a53f2>] page_fault+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and the destroy is discarded.

The cgroup can be unstuck by any of the following:
a) killing the faulting process, which lets the cgroup fully freeze
b) THAWing the cgroup which delivers the SIGKILL to the selected process, freeing memory for the faulting process to continue so the next freeze attempt succeeds
c) Increasing the memory cgroup's limit so the page fault can allocate a page

[idownes@hostname ~]$ cat /sys/fs/cgroup/memory/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/memory.limit_in_bytes
167772160
[root@hostname idownes]# echo 167782160 > /sys/fs/cgroup/memory/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/memory.limit_in_bytes
[root@hostname idownes]# cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
cat: /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state: No such file or directory

Attachments

Issue Links

is related to

MESOS-1758 Freezer failure leads to lost task during container destruction.

Resolved

MESOS-1765 Use PID namespace to avoid freezing cgroup

Resolved

Race with kernel to kill process / destroy cgroup after OOM

Details

Description

Attachments

Issue Links

Activity

People

Dates