Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
The mesos agent would restart due to many reasons (e.g., disk full). Always assume the provisioner 'rootfses' dir exists would block the agent to recover.
Failed to perform recovery: Collect failed: Unable to list rootfses belonged to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend directory: Failed to opendir '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses': No such file or directory
This issue may occur due to the race between removing the provisioner container dir and the agent restarts:
May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS | CLONE_NEWPID May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 'mesos_executors.slice' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's forked pid 11577 to '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS | CLONE_NEWPID May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 'mesos_executors.slice' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's forked pid 11579 to '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running command: /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE", May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the sandbox directory May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' to '/var/lib/mesos/slave/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011/executors/hello__91922a16-889e-4e94-9dab-9f6754f091de/ May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: Failed to fetch 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz': Error downloading resource: Failed writing received data to disk/application May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: End fetcher log for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: E0505 02:14:32.213114 11440 fetcher.cpp:558] Failed to run mesos-fetcher: Failed to fetch all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' with exit status: 256 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: E0505 02:14:32.213351 11444 slave.cpp:4642] Container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' for executor 'hello__91922a16-889e-4e94-9dab-9f6754f091de' of framework 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011 failed to start: Failed to fetch all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' with exit status: 256 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.213614 11443 containerizer.cpp:2071] Destroying container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac in FETCHING state May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.213977 11443 linux_launcher.cpp:505] Asked to destroy container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.214757 11443 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.216047 11444 cgroups.cpp:2692] Freezing cgroup /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.218407 11443 cgroups.cpp:1405] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after 2.326016ms May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.220391 11445 cgroups.cpp:2710] Thawing cgroup /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.222124 11445 cgroups.cpp:1434] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after 1.693952ms May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: E0505 02:14:32.239018 11441 fetcher.cpp:558] Failed to run mesos-fetcher: Failed to create 'stdout' file: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: E0505 02:14:32.239162 11442 slave.cpp:4642] Container 'a30b74d5-53ac-4fbf-b8f3-5cfba58ea847' for executor 'node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05' of framework 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008 failed to start: Failed to create 'stdout' file: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.239284 11445 containerizer.cpp:2071] Destroying container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 in FETCHING state May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.239390 11444 linux_launcher.cpp:505] Asked to destroy container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.240103 11444 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.241353 11440 cgroups.cpp:2692] Freezing cgroup /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.243120 11444 cgroups.cpp:1405] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after 1.726976ms May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.245045 11440 cgroups.cpp:2710] Thawing cgroup /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.246800 11440 cgroups.cpp:1434] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after 1.715968ms May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.285477 11438 slave.cpp:1625] Got assigned task 'dse-1-agent__720d6f09-9d60-4667-b224-abcd495e0e58' for framework 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0009 May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: F0505 02:14:32.296481 11438 slave.cpp:6381] CHECK_SOME(state::checkpoint(path, info)): Failed to create temporary file: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: *** Check failure stack trace: *** May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: @ 0x7f5856be857d google::LogMessage::Fail() May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: @ 0x7f5856bea3ad google::LogMessage::SendToLog() May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: @ 0x7f5856be816c google::LogMessage::Flush() May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: @ 0x7f5856beaca9 google::LogMessageFatal::~LogMessageFatal() May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: @ 0x7f5855e4b5e9 _CheckFatal::~_CheckFatal() May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.314082 11445 containerizer.cpp:2434] Container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac has exited May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.314826 11440 containerizer.cpp:2434] Container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 has exited May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: Failed to write: No space left on device May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316660 11439 container_assigner.cpp:101] Unregistering container_id[value: "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"]. May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316761 11474 container_assigner_strategy.cpp:202] Closing ephemeral-port reader for container[value: "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"] at endpoint[198.51.100.1:34273]. May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316804 11474 container_reader_impl.cpp:38] Triggering ContainerReader shutdown May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316833 11474 sync_util.hpp:39] Dispatching and waiting <=5s for ticket 7: ~ContainerReaderImpl:shutdown May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316769 11439 container_assigner.cpp:101] Unregistering container_id[value: "a30b74d5-53ac-4fbf-b8f3-5cfba58ea847"]. May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: I0505 02:14:32.316864 11474 container_reader
In provisioner recover, when listing the container rootfses, it is possible that the 'rootfses' dir does not exist. Because a possible race between the provisioner destroy and the agent restart. For instance, while the provisioner is destroying the container dir the agent restarts. Due to os::rmdir() is recursive by traversing the FTS tree, it is possible that 'rootfses' dir is removed but the others (e.g., scratch dir) are not.
Currently, we are returning an error if the 'rootfses' dir does not exist, which blocks the agent from recovery. We should skip it if 'rootfses' does not exist.