[MESOS-9223] Storage local provider does not sufficiently handle container launch failures or errors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.8.0
Component/s: agent, storage
Labels:

Epic Link:
Resource Provider and CSI Tech Debt
Sprint:
Mesosphere RI-6 Sprint 2018-31, Storage R8 Sprint 35, Storage R9 Sprint 36, Storage R9 Sprint 37, Storage R10 Sprint 38
Story Points:
3

Description

The storage local resource provider as currently implemented does not handle launch failures or task errors of its standalone containers well enough, If e.g., a RP container fails to come up during node start a warning would be logged, but an operator still needs to detect degraded functionality, manually check the state of containers with GET_CONTAINERS, and decide whether the agent needs restarting; I suspect they do not have always have enough context for this decision. It would be better if the provider would either enforce a restart by failing over the whole agent, or by retrying the operation (optionally: up to some maximum amount of retries).

Attachments

Issue Links

is related to

MESOS-8380 Update WebUI to show local resource providers.

Resolved

relates to

MESOS-8400 Handle plugin crashes gracefully in SLRP recovery.

Reviewable

Activity

People

Assignee:: Benjamin Bannier

Reporter:: Benjamin Bannier

Shepherd:: Chun-Hung Hsiao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Sep/18 11:10

Updated:: 24/Jan/19 09:40

Resolved:: 24/Jan/19 09:40