Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
1.7.3
-
None
-
None
Description
Hello all,
The documentation about Agent Recovery shows two conditions for the recovery to be possible:
- The agent must have recovery enabled (default true?);
- The scheduler must register itself saying that it has checkpointing enabled.
In my tests I'm using Marathon as the scheduler and Mesos itself sees Marathon as e checkpoint-enabled scheduler:
$ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, "id": .id, "checkpoint": .checkpoint, "active": .active}' { "name": "asgard-chronos", "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", "checkpoint": true, "active": true } { "name": "marathon", "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000", "checkpoint": true, "active": true } }}
Here is what I'm using:
- Mesos Master, 1.4.1
- Mesos Agent 1.7.3
- Using docker image mesos/mesos-centos:1.7.x
- Docker sock mounted from the host
- Docker binary also mounted from the host
- Marathon: 1.4.12
- Docker
Client: Docker Engine - Community Version: 19.03.5 API version: 1.39 (downgraded from 1.40) Go version: go1.12.12 Git commit: 633a0ea838 Built: Wed Nov 13 07:22:05 2019 OS/Arch: linux/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 18.09.2 API version: 1.39 (minimum version 1.12) Go version: go1.10.6 Git commit: 6247962 Built: Sun Feb 10 03:42:13 2019 OS/Arch: linux/amd64 Experimental: false
The problem
Here is the Marathon test app, a simple sleep 99d based on debian docker image.
{ "id": "/sleep", "cmd": "sleep 99d", "cpus": 0.1, "mem": 128, "disk": 0, "instances": 1, "constraints": [], "acceptedResourceRoles": [ "*" ], "container": { "type": "DOCKER", "volumes": [], "docker": { "image": "debian", "network": "HOST", "privileged": false, "parameters": [], "forcePullImage": true } }, "labels": {}, "portDefinitions": [] }
This task runs fine and get scheduled on the right agent, which is running mesos agent 1.7.3 (using the docker image, mesos/mesos-centos:1.7.x).
Here is a sample log:
mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating sandbox '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating sandbox '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' mesos-slave_1 | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to get resource statistics for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run '/usr/bin/docker -H unix:///var/run/docker.sock inspect --type=container mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1; stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923' mesos-slave_1 | I1205 13:24:24.094985 19853 docker.cpp:792] Checkpointing pid 12435 to '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid' mesos-slave_1 | I1205 13:24:24.343099 19848 slave.cpp:4839] Got registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources passed to update are identical to existing resources mesos-slave_1 | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:24.363222 19853 task_status_update_manager.cpp:328] Received task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:24.364115 19853 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding the update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050 mesos-slave_1 | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending acknowledgement for status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:24.401759 19846 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:24.401926 19846 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:30.482340 19848 task_status_update_manager.cpp:328] Received task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:30.482550 19848 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding the update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050 mesos-slave_1 | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending acknowledgement for status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653 mesos-slave_1 | I1205 13:24:30.557307 19852 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:24:30.557467 19852 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
An important part of this log:
Sending acknowledgement for status update TASK_RUNNING for task sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653
Here we have the executor's address that is responsible for our newly created task: executor(1)@10.234.172.56:16653.
Looking at the O.S process list we see one instance of mesos-docker-executor:
ps aux | grep executor root 12435 0.5 1.3 851796 54100 ? Ssl 10:24 0:03 mesos-docker-executor --cgroups_enable_cfs=true --container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --docker=/usr/bin/docker --docker_socket=/var/run/docker.sock --help=false --initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos --logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox --quiet=false --sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923 --stop_timeout=10secs root 12456 0.0 0.8 152636 34548 ? Sl 10:24 0:00 /usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 --cpu-quota 10000 --memory 134217728 -e HOST=10.234.172.56 -e MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e MARATHON_APP_RESOURCE_CPUS=0.1 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=128.0 -e MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox --net host --entrypoint /bin/sh --name mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --label=hollowman.appname=/asgard/sleep --label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c sleep 99d root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
Here we have 3 processes:
- Mesos executor;
- The container that this executor started, running sleep 99d;
- tail -f looking into stderr of this executor.
The stderr content is as follows:
$ tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3 I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent 79ad3a13-b567-4273-ac8c-30378d35a439-S60499 I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor on 10.234.172.56 I1205 13:24:24.350291 12449 executor.cpp:186] Starting task sleep.8c187c41-1762-11ea-a2e5-02429217540f WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
While Mesos Agent is still running it is possible to connect on this executor port.
telnet 10.234.172.56 16653 Trying 10.234.172.56... Connected to 10.234.172.56. Escape character is '^]'. ^] telnet> Connection closed.
As soon as the Agent shutsdown (a simple docker stop), this line appears on the stderr log of tyhe executor:
I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with agent 79ad3a13-b567-4273-ac8c-30378d35a439-S60499
And now, looking ate the O.S process list, the executor process is gone:
$ ps aux | grep executor root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
and we can't telnet to that port anymore:
telnet 10.234.172.56 16653 Trying 10.234.172.56... telnet: Unable to connect to remote host: Connection refused
When we run the Mesos Agent again, we see this:
mesos-slave_1 | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering Docker containers mesos-slave_1 | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list of Docker containers mesos-slave_1 | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to 10.234.172.56:16653: Connection refused mesos-slave_1 | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor for container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited mesos-slave_1 | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state mesos-slave_1 | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished processing orphaned Docker containers mesos-slave_1 | I1205 13:42:44.017102 18421 docker.cpp:2385] Running docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923 mesos-slave_1 | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering executors mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653 mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', connect: Failed to connect to 10.234.172.56:16653: Connection refused mesos-slave_1 | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status mesos-slave_1 | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0 mesos-slave_1 | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923 mesos-slave_1 | I1205 13:42:44.024247 18425 task_status_update_manager.cpp:328] Received task status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:42:44.024363 18425 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 mesos-slave_1 | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up un-reregistered executors mesos-slave_1 | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished recovery
And then a new container is scheduled to run. Marathon shows a status of TASK_FAILED with message Container terminated for the previous task.
Special attention to the part where the Agent seems to try to connect to the executor port:
mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653 mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', connect: Failed to connect to 10.234.172.56:16653: Connection refused
Looking at the mesos Agent recovery documentation the setup seems very straight forward. I'm using the default mesos-docker-executor to run all docker tasks. Is there any detail that I'm not seeing or is mis-configurated for my setup?
Can anyone confirm that the mesos-docker-executor is capable of doing task recovery?
Thank you very much and if you need any additional information, let me know.
Thanks,