[MESOS-9162] Unkillable pod container stuck in ISOLATING - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.0, 1.7.0
Fix Version/s: None
Component/s: containerization
Labels:
- container-stuck

Sprint:
Mesosphere Sprint 2018-27
Story Points:
5

Description

We have a simple test that launches a pod with two containers (one writes in a file and the other reads it). This test is flaky because the container sometimes fails to start.
Marathon app definition:

{
  "id": "/simple-pod",
  "scaling": {
    "kind": "fixed",
    "instances": 1
  },
  "environment": {
    "PING": "PONG"
  },
  "containers": [
    {
      "name": "ct1",
      "resources": {
        "cpus": 0.1,
        "mem": 32
      },
      "image": {
        "kind": "DOCKER",
        "id": "busybox"
      },
      "exec": {
        "command": {
          "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done"
        }
      },
      "volumeMounts": [
        {
          "name": "v1",
          "mountPath": "test-v1"
        }
      ]
    },
    {
      "name": "ct2",
      "resources": {
        "cpus": 0.1,
        "mem": 32
      },
      "exec": {
        "command": {
          "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done"
        }
      },
      "volumeMounts": [
        {
          "name": "v1",
          "mountPath": "etc"
        },
        {
          "name": "v2",
          "mountPath": "docker"
        }
      ]
    }
  ],
  "networks": [
    {
      "mode": "host"
    }
  ],
  "volumes": [
    {
      "name": "v1"
    },
    {
      "name": "v2",
      "host": "/var/lib/docker"
    }
  ]
}

During the test, Marathon tries to launch the pod but doesn't receive a TASK_RUNNING for the first container and so after 2min decides to kill the pod which also fails.

Agent sandbox (attached to this ticket minus docker layers, since they're too big to attach) shows that one of the containers wasn't started properly - the last line in the agent log says:

Transitioning the state of container ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 from PREPARING to ISOLATING

Until then the log looks pretty unspektakular.

Afterwards, Marathon tries to kill the container repeatedly, but doesn't succeed - the executor receives the reuests but doesn't send anything back:

I0816 22:52:53.111995     4 default_executor.cpp:204] Received SUBSCRIBED event
I0816 22:52:53.112520     4 default_executor.cpp:208] Subscribed executor on 10.10.0.222
I0816 22:52:53.112783     4 default_executor.cpp:204] Received LAUNCH_GROUP event
I0816 22:52:53.116516    11 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.10.0.222
I0816 22:52:53.169596     4 default_executor.cpp:204] Received ACKNOWLEDGED event
I0816 22:52:53.194416    10 default_executor.cpp:204] Received ACKNOWLEDGED event
I0816 22:54:50.559470     8 default_executor.cpp:204] Received KILL event
I0816 22:54:50.559496     8 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1'
I0816 22:54:50.559737     4 default_executor.cpp:204] Received KILL event
I0816 22:54:50.559751     4 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2'
...

Relevant Ids for grepping the logs:

Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9
Mesos tasks id: simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1
Mesos container id: e9b05652-e779-46f8-9b76-b2e1ce7e5940

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sandbox_10_10_0_222_var_lib.tar.gz
17/Aug/18 13:21
332 kB
A. Dukhovniy
diagnostics.zip
17/Aug/18 13:29
7.56 MB
A. Dukhovniy
dcos-mesos-slave.service.gz
17/Aug/18 13:14
522 kB
A. Dukhovniy
dcos-mesos-master.service.gz
17/Aug/18 13:17
667 kB
A. Dukhovniy
dcos-marathon.service.log
17/Aug/18 13:13
3.39 MB
A. Dukhovniy

Issue Links

duplicates

MESOS-9151 Container stuck at ISOLATING due to FD leak

Resolved

Activity

People

Assignee:: Gilbert Song

Reporter:: A. Dukhovniy

Shepherd:: Qian Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Aug/18 13:11

Updated:: 29/Aug/18 20:51

Resolved:: 29/Aug/18 20:51