Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.6.0, 1.7.0
-
None
-
Mesosphere Sprint 2018-27
-
5
Description
We have a simple test that launches a pod with two containers (one writes in a file and the other reads it). This test is flaky because the container sometimes fails to start.
Marathon app definition:
{ "id": "/simple-pod", "scaling": { "kind": "fixed", "instances": 1 }, "environment": { "PING": "PONG" }, "containers": [ { "name": "ct1", "resources": { "cpus": 0.1, "mem": 32 }, "image": { "kind": "DOCKER", "id": "busybox" }, "exec": { "command": { "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "test-v1" } ] }, { "name": "ct2", "resources": { "cpus": 0.1, "mem": 32 }, "exec": { "command": { "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "etc" }, { "name": "v2", "mountPath": "docker" } ] } ], "networks": [ { "mode": "host" } ], "volumes": [ { "name": "v1" }, { "name": "v2", "host": "/var/lib/docker" } ] }
During the test, Marathon tries to launch the pod but doesn't receive a TASK_RUNNING for the first container and so after 2min decides to kill the pod which also fails.
Agent sandbox (attached to this ticket minus docker layers, since they're too big to attach) shows that one of the containers wasn't started properly - the last line in the agent log says:
Transitioning the state of container ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 from PREPARING to ISOLATING
Until then the log looks pretty unspektakular.
Afterwards, Marathon tries to kill the container repeatedly, but doesn't succeed - the executor receives the reuests but doesn't send anything back:
I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED event I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 10.10.0.222 I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP event I0816 22:52:53.116516 11 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.10.0.222 I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED event I0816 22:52:53.194416 10 default_executor.cpp:204] Received ACKNOWLEDGED event I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1' I0816 22:54:50.559737 4 default_executor.cpp:204] Received KILL event I0816 22:54:50.559751 4 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2' ...
Relevant Ids for grepping the logs:
Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9 Mesos tasks id: simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1 Mesos container id: e9b05652-e779-46f8-9b76-b2e1ce7e5940
Attachments
Attachments
Issue Links
- duplicates
-
MESOS-9151 Container stuck at ISOLATING due to FD leak
- Resolved