[MESOS-6322] Agent fails to kill empty parent container - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: None
Labels:
- mesosphere

Target Version/s:

1.1.0
Epic Link:
pods
Sprint:
Mesosphere Sprint 44
Story Points:
3

Description

I launched a pod using Marathon, which led to the launching of a task group on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly re-launching multiple instances of the task group. After this went on for a few minutes, I told Marathon to scale the app to 0 instances, which sends TASK_KILLED for one task in each task group. After this, the Mesos agent reports TASK_KILLED status updates for all 3 tasks in the pod, but hitting the /containers endpoint on the agent reveals that the executor container for this task group is still running.

Here is the task group launching on the agent:

slave.cpp:1696] Launching task group containing tasks [ test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000

and here is the executor container starting:

mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000

and here is the output showing the TASK_KILLED updates for one task group:

mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning the state of task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 to TASK_KILLED because the executor is not registered
mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0

however, if we grep the log for the executor's ID, the last line mentioning it is:

slave.cpp:3080] Creating a marker file for HTTP based executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 (via HTTP) at path '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'

so it seems the executor never exited. If we hit the agent's /containers endpoint, we get output which includes this executor container:

{
    "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
    "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
    "executor_name": "",
    "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000",
    "source": "",
    "statistics": {
      "cpus_limit": 0.1,
      "cpus_nr_periods": 17,
      "cpus_nr_throttled": 11,
      "cpus_system_time_secs": 0.02,
      "cpus_throttled_time_secs": 0.784142448,
      "cpus_user_time_secs": 0.09,
      "disk_limit_bytes": 10485760,
      "disk_used_bytes": 20480,
      "mem_anon_bytes": 11337728,
      "mem_cache_bytes": 0,
      "mem_critical_pressure_counter": 0,
      "mem_file_bytes": 0,
      "mem_limit_bytes": 33554432,
      "mem_low_pressure_counter": 0,
      "mem_mapped_file_bytes": 0,
      "mem_medium_pressure_counter": 0,
      "mem_rss_bytes": 11337728,
      "mem_swap_bytes": 0,
      "mem_total_bytes": 12013568,
      "mem_unevictable_bytes": 0,
      "timestamp": 1475792290.12373
    },
    "status": {
      "executor_pid": 9068,
      "network_infos": [
        {
          "ip_addresses": [
            {
              "ip_address": "9.0.1.34",
              "protocol": "IPv4"
            }
          ],
          "labels": {},
          "name": "dcos",
          "port_mappings": [
            {
              "container_port": 8080,
              "host_port": 24758,
              "protocol": "tcp"
            },
            {
              "container_port": 8081,
              "host_port": 24759,
              "protocol": "tcp"
            }
          ]
        }
      ]
    }
  },

and looking through the output of ps on the agent, indeed we can locate the executor process:

$ ps aux | grep 9068
root      9068  0.0  0.1  96076 25380 ?        Ss   20:23   0:00 mesos-containerizer launch --command={"arguments":["mesos-default-executor"],"shell":false,"user":"root","value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-default-executor"} --help=false --pipe_read=49 --pipe_write=50 --pre_exec_commands=[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"},{"shell":true,"value":"ifconfig lo up"}] --runtime_directory=/var/run/mesos/containers/bf38ff09-3da1-487a-8926-1f4cc88bce32 --unshare_namespace_mnt=false --user=root --working_directory=/var/lib/mesos/slave/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32
core     20406  0.0  0.0   6764  1020 pts/1    S+   23:05   0:00 grep --colour=auto 9068
$ sudo cat /proc/9068/task/9068/children
9330
$ ps aux | grep 9330
root      9330  0.0  0.2 498040 32944 ?        Sl   20:23   0:00 mesos-default-executor
root     19330  0.0  0.0      0     0 ?        S    22:49   0:00 [kworker/0:2]
core     20573  0.0  0.0   6764   888 pts/1    S+   23:07   0:00 grep --colour=auto 9330

Looking at the executor's logs, we find stdout is:

Executing pre-exec command '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"}'
Executing pre-exec command '{"shell":true,"value":"ifconfig lo up"}'

and stderr is:

I1006 20:23:28.817401  9330 executor.cpp:189] Version: 1.1.0
I1006 20:23:28.906349  9352 default_executor.cpp:123] Received SUBSCRIBED event
I1006 20:23:28.908797  9352 default_executor.cpp:127] Subscribed executor on 10.0.0.133

With this short executor log, it seems possible that the agent received the TASK_KILLED before any tasks were sent to the executor, and the agent removed those tasks from its data structures without terminating the parent container.

Attachments

Issue Links

relates to

MESOS-5380 Killing a queued task can cause the corresponding command executor to never terminate.

Resolved

Activity

People

Assignee:: Anand Mazumdar

Reporter:: Greg Mann

Shepherd:: Vinod Kone

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Oct/16 23:31

Updated:: 08/Oct/16 04:18

Resolved:: 08/Oct/16 04:18