[MESOS-3573] Mesos does not kill orphaned docker containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: agent, docker
Labels:
- mesosphere

Sprint:
Mesosphere Sprint 32
Story Points:
5

Description

After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like there were changes between 0.23.0 and 0.24.0 that broke cleanup.

Here's how to trigger this bug:

1. Deploy app in docker container.
2. Kill corresponding mesos-docker-executor process
3. Observe hanging container

Here are the logs after kill:

slave_1    | I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
slave_1    | I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
slave_1    | I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
slave_1    | I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 20150923-122130-2153451692-5050-1-0000 terminated with signal Terminated
slave_1    | I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 from @0.0.0.0:0
slave_1    | I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
slave_1    | W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
slave_1    | I1002 12:12:59.368671  7791 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
slave_1    | I1002 12:12:59.368741  7791 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
slave_1    | I1002 12:12:59.370636  7791 status_update_manager.cpp:376] Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to the slave
slave_1    | I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to master@172.16.91.128:5050
slave_1    | I1002 12:12:59.371908  7791 slave.cpp:2899] Status update manager successfully handled status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
master_1   | I1002 12:12:59.372047    11 master.cpp:4069] Status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 from slave 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 (172.16.91.128)
master_1   | I1002 12:12:59.372534    11 master.cpp:4108] Forwarding status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
master_1   | I1002 12:12:59.373018    11 master.cpp:5576] Updating the latest state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to TASK_FAILED
master_1   | I1002 12:12:59.373447    11 hierarchical.hpp:814] Recovered cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from framework 20150923-122130-2153451692-5050-1-0000

Another issue: if you restart mesos-slave on the host with orphaned docker containers, they are not getting killed. This was the case before and I hoped for this trick to kill hanging containers, but it doesn't work now.

Marking this as critical because it hoards cluster resources and blocks scheduling.

Attachments

Issue Links

is duplicated by

MESOS-3808 slave/containerizer/docker leaves orphan containers on restart of mesos-slave

Resolved

Activity

People

Assignee:: Anand Mazumdar

Reporter:: Ivan Babrou

Shepherd:: Timothy Chen

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 02/Oct/15 12:26

Updated:: 26/Nov/18 12:20

Resolved:: 26/Nov/18 12:20