Details
Description
Docker changed its default mount propagation to "shared" since 1.12 to enable persistent volume plugins. However, Docker has a known issue (https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount namespace to other processes, which could make Mesos agents fail to remove Docker containers during recovery. The following shows the logs of such a faliure:
I0615 09:39:11.083787 4573 docker.cpp:1002] Skipping recovery of executor 'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework 44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked as docker and the docker container doesn't exist Failed to perform recovery: Collect failed: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock rm -v 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited with status 1; stderr='Error response from daemon: Driver overlay failed to remove root filesystem 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove /var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged: device or resource busy ' To remedy this do as follows: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest This ensures agent doesn't recover old live executors. Step 2: Restart the agent.
Attachments
Issue Links
- is related to
-
MESOS-7801 Retry logic for unsuccessful `docker rm` during agent recovery
-
- Accepted
-