Details
Description
We observed during some scale testing that we do internally.
When launching 300+ Docker containers on a single agent box, it's possible that the Docker containerizer actor gets backlogged. As a result, API processing like `GET_CONTAINERS` will become unresponsive. It'll also block Mesos containerizer from launching containers if one specified `--containers=docker,mesos` because Docker containerizer launch will be invoked first by the composing containerizer (and queued).
Profiling results show that the bottleneck is `os::killtree`, which will be invoked when the Docker commands are discarded (e.g., client disconnect, etc.).
For this particular case, killtree is not really necessary because the docker command does not fork additional subprocesses. If we use the argv version of `subprocess` to launch docker commands, we can simply use os::kill instead. We confirmed that, by switching to os::kill, the performance issues goes away, and the agent can easily scale up to 300+ containers.
Attachments
Attachments
Issue Links
- relates to
-
MESOS-9279 Docker Containerizer 'usage' call might be expensive if mount table is big.
- Resolved
-
MESOS-9268 Hitting agent's `/containers` endpoint might backlog Docker containerizer process.
- Resolved