[MESOS-5330] Agent should backoff before connecting to the master - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.28.3, 1.0.0
Component/s: None
Labels:
None

Description

When an agent is started it starts a background task (libprocess process?) to detect the leading master. When the leading master is detected (or changes) the SocketManager's link() method is called and a TCP connection to the master is established. The agent then backs off before sending a ReRegisterSlave message via the newly established connection. The agent needs to backoff before attempting to establish a TCP connection to the master, not before sending the first message over the connection.

During scale tests at Twitter we discovered that agents can SYN flood the master upon leader changes, then the problem described in ~~MESOS-5200~~ can occur where ephemeral connections are used, which exacerbates the problem. The end result is a lot of hosts setting up and tearing down TCP connections every slave_ping_timeout seconds (15 by default), connections failing to be established, hosts being marked as unhealthy and being shutdown. We observed ~800 passive TCP connections per second on the leading master during scale tests.

The problem can be somewhat mitigated by tuning the kernel to handle a thundering herd of TCP connections, but ideally there would not be a thundering herd to begin with.

Attachments

Issue Links

relates to

MESOS-5359 The scheduler library should have a delay before initiating a connection with master.

Resolved

Activity

People

Assignee:: Daniel Robinson

Reporter:: Daniel Robinson

Shepherd:: Benjamin Mahler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/May/16 00:56

Updated:: 22/Mar/19 16:40

Resolved:: 13/May/16 04:08