Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3808

slave/containerizer/docker leaves orphan containers on restart of mesos-slave

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.25.0
    • None
    • agent, containerization, docker
    • None
    • CoreOS. Running mesos-slave in a container.

    Description

      We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker containers launched by Mesos were being orphaned and not destroyed when the Mesos agent was restarted.

      Relavent log output:

      I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
      I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
      I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
      I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
      I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
      I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
      E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
      I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
      I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
      I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
      E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
      I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
      I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
      I1027 20:36:22.518592 23005 slave.cpp:3433] Executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
      I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
      I1027 20:36:22.519263 23005 slave.cpp:3433] Executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
      I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
      W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
      W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
      I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
      I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
      I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
      I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
      W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
      W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
      I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered executors
      I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
      

      Docker output:

      CONTAINER ID        IMAGE                             COMMAND                CREATED              STATUS              PORTS               NAMES
      8d0d69fe34d7        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
      e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
      3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
      5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
      

      After digging in to the issue it seems the below comment might be the problem.
      https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97

      It appears that the recovery command is still only sending the containerId and not the frameworkId + containerId.

      Attachments

        Issue Links

          Activity

            People

              gilbert Gilbert Song
              cfortier Chris Fortier
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified