Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4092

Try to re-establish connection on ping timeouts with agent before removing it

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • 0.25.0
    • None
    • master
    • None

    Description

      The SlaveObserver will trigger an agent to be removed after flags.max_slave_ping_timeouts timeouts of flags.slave_ping_timeout. This can occur because of transient network failures, e.g., gray failures of a switch uplink exhibiting heavy or total packet loss. Some network architectures are designed to tolerate such gray failures and support multiple paths between hosts. This can be implemented with equal-cost multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple possible uplinks. In such networks re-establishing a TCP connection will almost certainly use a new source port and thus will likely be hashed to a different uplink, avoiding the failed uplink and re-establishing connectivity with the agent.

      After failing to receive pongs the SlaveObserver should next try to re-establish a TCP connection (with exponential back-off) before declaring the agent as lost. This can avoid significant disruption where large numbers of agents reached through a single failed link could be removed unnecessarily while still ensuring that agents that are truly lost are recognized as such.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              idownes Ian Downes
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: