Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9223

Storage local provider does not sufficiently handle container launch failures or errors

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Mesosphere RI-6 Sprint 2018-31, Storage R8 Sprint 35, Storage R9 Sprint 36, Storage R9 Sprint 37, Storage R10 Sprint 38
    • 3

    Description

      The storage local resource provider as currently implemented does not handle launch failures or task errors of its standalone containers well enough, If e.g., a RP container fails to come up during node start a warning would be logged, but an operator still needs to detect degraded functionality, manually check the state of containers with GET_CONTAINERS, and decide whether the agent needs restarting; I suspect they do not have always have enough context for this decision. It would be better if the provider would either enforce a restart by failing over the whole agent, or by retrying the operation (optionally: up to some maximum amount of retries).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bbannier Benjamin Bannier
            bbannier Benjamin Bannier
            Chun-Hung Hsiao Chun-Hung Hsiao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprints:
                Mesosphere RI-6 Sprint 2018-31 ended 25/Oct/18
                Storage R8 Sprint 35 ended 19/Dec/18
                Storage R9 Sprint 36 ended 08/Jan/19
                Storage R9 Sprint 37 ended 16/Jan/19
                Storage R10 Sprint 38 ended 30/Jan/19
                View on Board

                Slack

                  Issue deployment