Details
-
Improvement
-
Status: Reviewable
-
Blocker
-
Resolution: Unresolved
-
None
-
None
-
None
-
Storage: RI-18 54
-
5
Description
When a CSI plugin crashes, the container daemon in SLRP will reset its corresponding csi::Client service future. However, if a CSI call races with a plugin crash, the call may be issued before the service future is reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses this for CreateVolume and DeleteVolume calls, but calls in the SLRP recovery path, e.g., ListVolume, GetCapacity, Probe, could make the SLRP unrecoverable.
There are two main issues:
1. For Probe, we should investigate if it is needed to make a few retry attempts, then after that, we should recover from failed attempts (e.g., kill the plugin container), then make the container daemon relaunch the plugin instead of failing the daemon.
2. For other calls in the recovery path, we should either retry the call, or make the local resource provider daemon be able to restart the SLRP after it fails.
Attachments
Issue Links
- blocks
-
MESOS-9130 Test `StorageLocalResourceProviderTest.ROOT_ContainerTerminationMetric` is flaky.
- Resolved
- is related to
-
MESOS-9223 Storage local provider does not sufficiently handle container launch failures or errors
- Resolved
- relates to
-
MESOS-9517 SLRP should treat gRPC timeouts as non-terminal errors, instead of reporting OPERATION_FAILED.
- Resolved