Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7426

Support for agent lifecycle management.

    XMLWordPrintableJSON

Details

    • Epic
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • None
    • None
    • agent
    • Mesos Agent Lifecycle

    Description

      This epic co-ordinates the work for introducing agent lifecycle management in Mesos allowing a framework to be notified in case of agent node failures. The existing Event::Failure is not enough for frameworks to know that the given agent node isn't ever coming back.

      The primary motivations for introducing such a feature would be:

      • Currently, when an agent running a task fails, there is inherently an operator interference needed (manual step) to remove the node via a configuration API exposed by the framework e.g., dcos cassandra node replace for the cassandra framework. This needs to be done once for every stateful framework running on the cluster.
      • When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
      • When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
      • When the fault domain associated with an agent changes (e.g., it is moved from an unallocated rack to an allocated rack), there is no feedback mechanism for the framework.

      Attachments

        Issue Links

          Activity

            People

              anandmazumdar Anand Mazumdar
              anandmazumdar Anand Mazumdar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: