Details
-
Epic
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
None
-
None
-
Mesos Agent Lifecycle
Description
This epic co-ordinates the work for introducing agent lifecycle management in Mesos allowing a framework to be notified in case of agent node failures. The existing Event::Failure is not enough for frameworks to know that the given agent node isn't ever coming back.
The primary motivations for introducing such a feature would be:
- Currently, when an agent running a task fails, there is inherently an operator interference needed (manual step) to remove the node via a configuration API exposed by the framework e.g., dcos cassandra node replace for the cassandra framework. This needs to be done once for every stateful framework running on the cluster.
- When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
- When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
- When the fault domain associated with an agent changes (e.g., it is moved from an unallocated rack to an allocated rack), there is no feedback mechanism for the framework.
Attachments
Issue Links
- is related to
-
MESOS-8518 Make lost agent notifications optional for frameworks.
- Open
- relates to
-
MESOS-5368 Consider introducing persistent agent ID
- Open