Details
Description
In the event of a network partition, or other systemic issues, we may see widespread slave removal. There are several approaches we can take to mitigate this issue including, but not limited to:
. rate limit the slave removal
. change how we do health checking to not rely on a single point of view
. work with frameworks to determine SLA of running services before removing the slave
. manual control to allow operator intervention
Attachments
Issue Links
- is related to
-
MESOS-3703 Give frameworks more control when agents fail health checks
- Open