Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-6322

Agent fails to kill empty parent container

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 1.1.0
    • None
    • Mesosphere Sprint 44
    • 3

    Description

      I launched a pod using Marathon, which led to the launching of a task group on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly re-launching multiple instances of the task group. After this went on for a few minutes, I told Marathon to scale the app to 0 instances, which sends TASK_KILLED for one task in each task group. After this, the Mesos agent reports TASK_KILLED status updates for all 3 tasks in the pod, but hitting the /containers endpoint on the agent reveals that the executor container for this task group is still running.

      Here is the task group launching on the agent:

      slave.cpp:1696] Launching task group containing tasks [ test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
      

      and here is the executor container starting:

      mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
      

      and here is the output showing the TASK_KILLED updates for one task group:

      mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
      mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning the state of task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 to TASK_KILLED because the executor is not registered
      mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
      mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
      mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
      

      however, if we grep the log for the executor's ID, the last line mentioning it is:

      slave.cpp:3080] Creating a marker file for HTTP based executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 (via HTTP) at path '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'
      

      so it seems the executor never exited. If we hit the agent's /containers endpoint, we get output which includes this executor container:

      {
          "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
          "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
          "executor_name": "",
          "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000",
          "source": "",
          "statistics": {
            "cpus_limit": 0.1,
            "cpus_nr_periods": 17,
            "cpus_nr_throttled": 11,
            "cpus_system_time_secs": 0.02,
            "cpus_throttled_time_secs": 0.784142448,
            "cpus_user_time_secs": 0.09,
            "disk_limit_bytes": 10485760,
            "disk_used_bytes": 20480,
            "mem_anon_bytes": 11337728,
            "mem_cache_bytes": 0,
            "mem_critical_pressure_counter": 0,
            "mem_file_bytes": 0,
            "mem_limit_bytes": 33554432,
            "mem_low_pressure_counter": 0,
            "mem_mapped_file_bytes": 0,
            "mem_medium_pressure_counter": 0,
            "mem_rss_bytes": 11337728,
            "mem_swap_bytes": 0,
            "mem_total_bytes": 12013568,
            "mem_unevictable_bytes": 0,
            "timestamp": 1475792290.12373
          },
          "status": {
            "executor_pid": 9068,
            "network_infos": [
              {
                "ip_addresses": [
                  {
                    "ip_address": "9.0.1.34",
                    "protocol": "IPv4"
                  }
                ],
                "labels": {},
                "name": "dcos",
                "port_mappings": [
                  {
                    "container_port": 8080,
                    "host_port": 24758,
                    "protocol": "tcp"
                  },
                  {
                    "container_port": 8081,
                    "host_port": 24759,
                    "protocol": "tcp"
                  }
                ]
              }
            ]
          }
        },
      

      and looking through the output of ps on the agent, indeed we can locate the executor process:

      $ ps aux | grep 9068
      root      9068  0.0  0.1  96076 25380 ?        Ss   20:23   0:00 mesos-containerizer launch --command={"arguments":["mesos-default-executor"],"shell":false,"user":"root","value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-default-executor"} --help=false --pipe_read=49 --pipe_write=50 --pre_exec_commands=[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"},{"shell":true,"value":"ifconfig lo up"}] --runtime_directory=/var/run/mesos/containers/bf38ff09-3da1-487a-8926-1f4cc88bce32 --unshare_namespace_mnt=false --user=root --working_directory=/var/lib/mesos/slave/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32
      core     20406  0.0  0.0   6764  1020 pts/1    S+   23:05   0:00 grep --colour=auto 9068
      $ sudo cat /proc/9068/task/9068/children
      9330
      $ ps aux | grep 9330
      root      9330  0.0  0.2 498040 32944 ?        Sl   20:23   0:00 mesos-default-executor
      root     19330  0.0  0.0      0     0 ?        S    22:49   0:00 [kworker/0:2]
      core     20573  0.0  0.0   6764   888 pts/1    S+   23:07   0:00 grep --colour=auto 9330
      

      Looking at the executor's logs, we find stdout is:

      Executing pre-exec command '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"}'
      Executing pre-exec command '{"shell":true,"value":"ifconfig lo up"}'
      

      and stderr is:

      I1006 20:23:28.817401  9330 executor.cpp:189] Version: 1.1.0
      I1006 20:23:28.906349  9352 default_executor.cpp:123] Received SUBSCRIBED event
      I1006 20:23:28.908797  9352 default_executor.cpp:127] Subscribed executor on 10.0.0.133
      

      With this short executor log, it seems possible that the agent received the TASK_KILLED before any tasks were sent to the executor, and the agent removed those tasks from its data structures without terminating the parent container.

      Attachments

        Issue Links

          Activity

            People

              anandmazumdar Anand Mazumdar
              greggomann Greg Mann
              Vinod Kone Vinod Kone
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: