Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8391

Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • None
    • 3

    Description

      (1) Agent doesn't detect that a pod task exits/crashes

      1. Create a Marathon pod with two containers which just do sleep 10000.
      2. Restart the Mesos agent on the node the pod got launched.
      3. Kill one of the pod tasks

      Expected result: The Mesos agent detects that one of the tasks got killed, and forwards TASK_FAILED status to Marathon.

      Actual result: The Mesos agent does nothing, and the Mesos master thinks that both tasks are running just fine. Marathon doesn't take any action because it doesn't receive any update from Mesos.

      (2) After the agent restart, it detects that the task crashed, forwards the correct status update, but the other task stays in TASK_KILLING state forever

      1. Perform steps in (1).
      2. Restart the Mesos agent

      Expected result: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, and kills the other task too.

      Actual result: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, but the other task stays in `TASK_KILLING` state forever.

      Please note, that after another agent restart, the other tasks gets finally killed and the correct status updates get propagated all the way to Marathon.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            abudnik Andrei Budnik
            ichernetsky Ivan Chernetsky
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment