Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1503

Improve slave health checking to prevent rapid widespread slave removals.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: master
    • Labels:

      Description

      Per some discussions with Tobias Weingartner and Vinod Kone.

      Currently the master uses a SlaveObserver for each registered slave. Each SlaveObserver operates independently and makes decisions about whether the slave is healthy.

      The independence of these observers means that in some very rare events (e.g. masters are partitioned from 75% of slaves), the master can very rapidly remove a large portion of the slaves in the cluster. Ideally such an event could be deemed dangerous and throttled accordingly through a more intelligent notion of overall cluster health.

      It may be nice to have a single observer that is responsible for health checking all the slaves. This will allow us to make safer decisions as to when to determine that slaves are unhealthy.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              bmahler Benjamin Mahler
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: