Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2367

Improve slave resiliency in the face of orphan containers

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: agent
    • Labels:
      None
    • Sprint:
      Twitter Mesos Q1 Sprint 5, Twitter Q2 Sprint 1 - 4/13
    • Story Points:
      5

      Description

      Right now there's a case where a misbehaving executor can cause a slave process to flap:

      Quote From Jie Yu

      1) User tries to kill an instance
      2) Slave sends KillTaskMessage to executor
      3) Executor sends kill signals to task processes
      4) Executor sends TASK_KILLED to slave
      5) Slave updates container cpu limit to be 0.01 cpus
      6) A user-process is still processing the kill signal
      7) the task process cannot exit since it has too little cpu share and is throttled
      8) Executor itself terminates
      9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path.
      10) Slave restarts, and is constantly flapping because it cannot kill orphan containers

      The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers).

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              jieyu Jie Yu
              Reporter:
              yasumoto Joe Smith

              Dates

              • Created:
                Updated:
                Resolved:

                Agile

                  Issue deployment