[MESOS-2367] Improve slave resiliency in the face of orphan containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.23.0
Component/s: agent
Labels:
None

Sprint:
Twitter Mesos Q1 Sprint 5, Twitter Q2 Sprint 1 - 4/13
Story Points:
5

Description

Right now there's a case where a misbehaving executor can cause a slave process to flap:

Quote From jieyu

1) User tries to kill an instance
2) Slave sends KillTaskMessage to executor
3) Executor sends kill signals to task processes
4) Executor sends TASK_KILLED to slave
5) Slave updates container cpu limit to be 0.01 cpus
6) A user-process is still processing the kill signal
7) the task process cannot exit since it has too little cpu share and is throttled
8) Executor itself terminates
9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path.
10) Slave restarts, and is constantly flapping because it cannot kill orphan containers

The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers).

Attachments

Issue Links

is blocked by

MESOS-2528 Symlink the namespace handle with ContainerID for the port mapping isolator.

Resolved

relates to

MESOS-8830 Agent gc on old slave sandboxes could empty persistent volume data

Resolved

MESOS-2903 Network isolator should not fail when target state already exists

Resolved

Activity

People

Assignee:: Jie Yu

Reporter:: Joe Smith

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Feb/15 02:35

Updated:: 10/May/18 17:07

Resolved:: 24/Apr/15 23:38