Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3573

Mesos does not kill orphaned docker containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • agent, docker
    • Mesosphere Sprint 32
    • 5

    Description

      After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like there were changes between 0.23.0 and 0.24.0 that broke cleanup.

      Here's how to trigger this bug:

      1. Deploy app in docker container.
      2. Kill corresponding mesos-docker-executor process
      3. Observe hanging container

      Here are the logs after kill:

      slave_1    | I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
      slave_1    | I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
      slave_1    | I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
      slave_1    | I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 20150923-122130-2153451692-5050-1-0000 terminated with signal Terminated
      slave_1    | I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 from @0.0.0.0:0
      slave_1    | I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
      slave_1    | W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
      slave_1    | I1002 12:12:59.368671  7791 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
      slave_1    | I1002 12:12:59.368741  7791 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
      slave_1    | I1002 12:12:59.370636  7791 status_update_manager.cpp:376] Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to the slave
      slave_1    | I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to master@172.16.91.128:5050
      slave_1    | I1002 12:12:59.371908  7791 slave.cpp:2899] Status update manager successfully handled status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
      master_1   | I1002 12:12:59.372047    11 master.cpp:4069] Status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 from slave 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 (172.16.91.128)
      master_1   | I1002 12:12:59.372534    11 master.cpp:4108] Forwarding status update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000
      master_1   | I1002 12:12:59.373018    11 master.cpp:5576] Updating the latest state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 20150923-122130-2153451692-5050-1-0000 to TASK_FAILED
      master_1   | I1002 12:12:59.373447    11 hierarchical.hpp:814] Recovered cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from framework 20150923-122130-2153451692-5050-1-0000
      

      Another issue: if you restart mesos-slave on the host with orphaned docker containers, they are not getting killed. This was the case before and I hoped for this trick to kill hanging containers, but it doesn't work now.

      Marking this as critical because it hoards cluster resources and blocks scheduling.

      Attachments

        Issue Links

          Activity

            People

              anandmazumdar Anand Mazumdar
              bobrik Ivan Babrou
              Timothy Chen Timothy Chen
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: