Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2367

Improve slave resiliency in the face of orphan containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 0.23.0
    • agent
    • None
    • Twitter Mesos Q1 Sprint 5, Twitter Q2 Sprint 1 - 4/13
    • 5

    Description

      Right now there's a case where a misbehaving executor can cause a slave process to flap:

      Quote From jieyu

      1) User tries to kill an instance
      2) Slave sends KillTaskMessage to executor
      3) Executor sends kill signals to task processes
      4) Executor sends TASK_KILLED to slave
      5) Slave updates container cpu limit to be 0.01 cpus
      6) A user-process is still processing the kill signal
      7) the task process cannot exit since it has too little cpu share and is throttled
      8) Executor itself terminates
      9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path.
      10) Slave restarts, and is constantly flapping because it cannot kill orphan containers

      The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers).

      Attachments

        Issue Links

          Activity

            People

              jieyu Jie Yu
              yasumoto Joe Smith
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: