[MESOS-10005] Only broadcast framework update to agents associated with framework - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Accepted
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.9.0
Fix Version/s: None
Component/s: master
Labels:
None
Environment:

Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.

Description

We have at any given time ~100 frameworks connected to our Mesos Master with agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been encounter a crash (which I'll document separately) and when that happens, the new Mesos Master will sometimes (but not always) struggle to catch up, and eventually crash again. Usually the third or fourth crash will end with a stable master (not ideal, but at least we can get to stable).

Looking over the logs, I'm seeing hundreds of attempts to contact dead agents each second (and presumably many contacts with healthy agents that don't throw an error):

W1003 21:39:39.299998  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: Failed to connect to 100.82.103.99:5051: Connection refused W1003 21:39:39.300143  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: Failed to connect to 100.85.122.190:5051: Connection refused W1003 21:39:39.300285  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: Failed to connect to 100.85.84.187:5051: Connection refused W1003 21:39:39.302122  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: Failed to connect to 100.82.163.228:5051: Connection refused

I gave bmahler a perf trace of the master on Slack at this point, and it looks like the master at is spending a significant amount of time doing framework update broadcasting. I'll attach the perf dump to the ticket, as well as the log of what the master did while it was alive.

It sounds like currently, every framework update (100 total frameworks in our case) results in a broadcast to all 6000-11000 agents (depending on how busy the cluster is). Also, since our health checks rely on the UI currently, we usually end up killing the master because it fails a health check for long periods of time while overwhelmed by doing these broadcasts.

Could optimizations to be made to either throttle these broadcasts or to only target nodes which need those framework updates?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mesos-master.stacks - 5 - new healthy master.gz
04/Oct/19 00:04
13.19 MB
Terra Field
mesos-master.stacks - 4 - framework update - 1.9.0.gz
04/Oct/19 00:09
12.49 MB
Terra Field
mesos-master.stacks - 3 - 1.9.0.gz
04/Oct/19 00:04
12.53 MB
Terra Field
mesos-master.stacks - 2 - 1.9.0.gz
04/Oct/19 00:03
11.01 MB
Terra Field
mesos-master.log.gz
04/Oct/19 00:02
2.41 MB
Terra Field
0001-Send-framework-updates-in-parallel.patch
18/Oct/19 19:07
2 kB
Benjamin Mahler

Issue Links

is blocked by

MESOS-1961 Ensure executor state is correctly reconciled between master and slave.

Accepted

is related to

MESOS-5874 Only send ShutdownFrameworkMessage to agents associated with framework.

Open

relates to

MESOS-10006 Crash in Sorter: "Check failed: resources.contains(slaveId)"

Open

Activity

People

Assignee:: Unassigned

Reporter:: Terra Field

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Oct/19 00:08

Updated:: 18/Oct/19 19:07