Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
Mesosphere RI-6 Sprint 2018-31, Storage R8 Sprint 35, Storage R9 Sprint 36, Storage R9 Sprint 37, Storage R10 Sprint 38
-
3
Description
The storage local resource provider as currently implemented does not handle launch failures or task errors of its standalone containers well enough, If e.g., a RP container fails to come up during node start a warning would be logged, but an operator still needs to detect degraded functionality, manually check the state of containers with GET_CONTAINERS, and decide whether the agent needs restarting; I suspect they do not have always have enough context for this decision. It would be better if the provider would either enforce a restart by failing over the whole agent, or by retrying the operation (optionally: up to some maximum amount of retries).
Attachments
Issue Links
- is related to
-
MESOS-8380 Update WebUI to show local resource providers.
- Resolved
- relates to
-
MESOS-8400 Handle plugin crashes gracefully in SLRP recovery.
- Reviewable