Details
-
Bug
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
1.2.3, 1.4.2, 1.5.1, 1.6.0
-
None
-
None
Description
This issue happens due to a very slow/unresponsive Docker daemon.
Observed behaviour of the Docker executor:
- Agent launches the Docker executor, which calls `docker run` to launch a container.
- `docker inspect` hangs each time it's called, so the docker executor retries in a loop without success.
- After 5 minutes, a framework (Marathon) sends first `killTask` message, which interrupts the previous `docker inspect` loop.
- Then, `killTask()` launches the very first `docker stop`, which hangs.
- The framework sends the second `killTask()` after 20 seconds which interrupts the first `docker stop` command.
- The framework continues to send `killTask()` every 20 seconds, but `docker stop` always immediately returns an error: "Error response from daemon: No such container: mesos-some-UID".
Since `docker run` hangs, `reaped()` callback is never called. Thus, the Docker executor gets stuck in an infinite `docker stop` loop.