Details
Description
Right now there's a case where a misbehaving executor can cause a slave process to flap:
1) User tries to kill an instance
2) Slave sends KillTaskMessage to executor
3) Executor sends kill signals to task processes
4) Executor sends TASK_KILLED to slave
5) Slave updates container cpu limit to be 0.01 cpus
6) A user-process is still processing the kill signal
7) the task process cannot exit since it has too little cpu share and is throttled
8) Executor itself terminates
9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path.
10) Slave restarts, and is constantly flapping because it cannot kill orphan containers
The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers).
Attachments
Issue Links
- is blocked by
-
MESOS-2528 Symlink the namespace handle with ContainerID for the port mapping isolator.
- Resolved
- relates to
-
MESOS-8830 Agent gc on old slave sandboxes could empty persistent volume data
- Resolved
-
MESOS-2903 Network isolator should not fail when target state already exists
- Resolved