Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8400

Handle plugin crashes gracefully in SLRP recovery.

    XMLWordPrintableJSON

Details

    Description

      When a CSI plugin crashes, the container daemon in SLRP will reset its corresponding csi::Client service future. However, if a CSI call races with a plugin crash, the call may be issued before the service future is reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses this for CreateVolume and DeleteVolume calls, but calls in the SLRP recovery path, e.g., ListVolume, GetCapacity, Probe, could make the SLRP unrecoverable.

      There are two main issues:
      1. For Probe, we should investigate if it is needed to make a few retry attempts, then after that, we should recover from failed attempts (e.g., kill the plugin container), then make the container daemon relaunch the plugin instead of failing the daemon.

      2. For other calls in the recovery path, we should either retry the call, or make the local resource provider daemon be able to restart the SLRP after it fails.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chhsia0 Chun-Hung Hsiao
              Chun-Hung Hsiao Chun-Hung Hsiao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: