When a CSI plugin crashes, the container daemon in SLRP will reset its corresponding csi::Client service future. However, if a CSI call races with a plugin crash, the call may be issued before the service future is reset, resulting in a failure for that CSI call.
MESOS-9517 partly addresses this for CreateVolume and DeleteVolume calls, but calls in the SLRP recovery path, e.g., ListVolume, GetCapacity, Probe, could make the SLRP unrecoverable.
There are two main issues:
1. For Probe, we should investigate if it is needed to make a few retry attempts, then after that, we should recover from failed attempts (e.g., kill the plugin container), then make the container daemon relaunch the plugin instead of failing the daemon.
2. For other calls in the recovery path, we should either retry the call, or make the local resource provider daemon be able to restart the SLRP after it fails.