Details
Description
We had an issue in which custom Cassandra executors (which does not use any container image thus running on host filesystem) saw its persistent volume data got wiped out.
Upon revisiting logs, we found following suspicious lines:
I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max allowed age: 4.764742265646493days I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429704593days I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429587852days I0424 02:06:11.717183 10994 gc.cpp:133] Deleting /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 I0424 02:06:11.727033 10994 gc.cpp:146] Deleted '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44' I0424 02:06:11.727094 10994 gc.cpp:133] Deleting /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 127.0.0.1:53602 with User-Agent='Go-http-client/1.1' E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume: Device or resource busy E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149: Directory not empty E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs: Directory not empty E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a: Directory not empty E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors: Directory not empty E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004: Directory not empty E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks: Directory not empty E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44: Directory not empty
(I can try to provide more logs, depending on how much local archive after rotation has)
This happened on a 1.3.1 agent although I suspect it's not local to that version.
The path /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume is a bind mount to a persistent volume. The fact that agent gc touched that process makes me believe this is what triggered the data loss.
We had some misconfigurations on out fleet, and I do not know whether the previous slave (id-ed 70279b87-553a-4213-a85b-46fdc191849d-S4) was shutdown cleanly yet.
My questions/suggestions:
1) if an executor was asked to shutdown by a new agent (with a new id), how much of persistent volume clean up code will be executed (especially if new agent do not really know this executor should be running anymore)?
2) should we figure out a way to better protect bind mounts to persistent volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), to prevent similar problems?
Attachments
Issue Links
- is related to
-
MESOS-2367 Improve slave resiliency in the face of orphan containers
- Resolved
- relates to
-
MESOS-9352 Data in persistent volume deleted accidentally when using Docker container and Persistent volume
- Resolved