Details
-
Improvement
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
0.25.0
-
None
-
None
Description
The SlaveObserver will trigger an agent to be removed after flags.max_slave_ping_timeouts timeouts of flags.slave_ping_timeout. This can occur because of transient network failures, e.g., gray failures of a switch uplink exhibiting heavy or total packet loss. Some network architectures are designed to tolerate such gray failures and support multiple paths between hosts. This can be implemented with equal-cost multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple possible uplinks. In such networks re-establishing a TCP connection will almost certainly use a new source port and thus will likely be hashed to a different uplink, avoiding the failed uplink and re-establishing connectivity with the agent.
After failing to receive pongs the SlaveObserver should next try to re-establish a TCP connection (with exponential back-off) before declaring the agent as lost. This can avoid significant disruption where large numbers of agents reached through a single failed link could be removed unnecessarily while still ensuring that agents that are truly lost are recognized as such.
Attachments
Issue Links
- relates to
-
MESOS-5740 Consider adding `relink` functionality to libprocess
- Resolved