Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-695

Introduce automated self-healing and coordinated repair to Mesos

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • master
    • None

    Description

      One capability that is presently missing within the Mesos framework is the ability for the system to self-heal. Specifically, the ability for a master to detect something is amiss with a particular host and then to attempt to heal that host through a set of automated corrective actions such as:

      1) restarting process on the suspect node
      2) rebooting the node
      3) reimaging the node
      4) blacklisting node from future scheduled work

      By adding in this capability and informing schedulers of the behavior of the hosts within the system it's believed that we can get Mesos to function in more of a, 'lights out' mode thereby reducing the OpEx costs for running the system today.

      It should be noted that a certain amount of coordination will be required in order to ensure that we don't, 'repair" too many nodes at the same time. This logic will need to be centralized and such that there is a central authority who is elected to make these decisions.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jcurrier Jeff Currier
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: