[MESOS-8830] Agent gc on old slave sandboxes could empty persistent volume data - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.1, 1.4.1, 1.5.0, 1.6.0
Fix Version/s: 1.4.2, 1.5.2, 1.6.1, 1.7.0
Component/s: None
Labels:
None

Description

We had an issue in which custom Cassandra executors (which does not use any container image thus running on host filesystem) saw its persistent volume data got wiped out.

Upon revisiting logs, we found following suspicious lines:

log

I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max allowed age: 4.764742265646493days
I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429704593days
I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429587852days
I0424 02:06:11.717183 10994 gc.cpp:133] Deleting /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
I0424 02:06:11.727033 10994 gc.cpp:146] Deleted '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
I0424 02:06:11.727094 10994 gc.cpp:133] Deleting /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume: Device or resource busy
E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149: Directory not empty
E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs: Directory not empty
E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a: Directory not empty
E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors: Directory not empty
E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004: Directory not empty
E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks: Directory not empty
E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44: Directory not empty

(I can try to provide more logs, depending on how much local archive after rotation has)

This happened on a 1.3.1 agent although I suspect it's not local to that version.

The path /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume is a bind mount to a persistent volume. The fact that agent gc touched that process makes me believe this is what triggered the data loss.

We had some misconfigurations on out fleet, and I do not know whether the previous slave (id-ed 70279b87-553a-4213-a85b-46fdc191849d-S4) was shutdown cleanly yet.

My questions/suggestions:

1) if an executor was asked to shutdown by a new agent (with a new id), how much of persistent volume clean up code will be executed (especially if new agent do not really know this executor should be running anymore)?
2) should we figure out a way to better protect bind mounts to persistent volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), to prevent similar problems?

Attachments

Issue Links

is related to

MESOS-2367 Improve slave resiliency in the face of orphan containers

Resolved

relates to

MESOS-9352 Data in persistent volume deleted accidentally when using Docker container and Persistent volume

Resolved

Agent gc on old slave sandboxes could empty persistent volume data

Details

Description

Attachments

Issue Links

Activity

People

Dates