Details
-
Improvement
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
1.9.0
-
None
-
None
-
Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
Description
We have at any given time ~100 frameworks connected to our Mesos Master with agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been encounter a crash (which I'll document separately) and when that happens, the new Mesos Master will sometimes (but not always) struggle to catch up, and eventually crash again. Usually the third or fourth crash will end with a stable master (not ideal, but at least we can get to stable).
Looking over the logs, I'm seeing hundreds of attempts to contact dead agents each second (and presumably many contacts with healthy agents that don't throw an error):
W1003 21:39:39.299998 8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: Failed to connect to 100.82.103.99:5051: Connection refused W1003 21:39:39.300143 8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: Failed to connect to 100.85.122.190:5051: Connection refused W1003 21:39:39.300285 8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: Failed to connect to 100.85.84.187:5051: Connection refused W1003 21:39:39.302122 8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: Failed to connect to 100.82.163.228:5051: Connection refused
I gave bmahler a perf trace of the master on Slack at this point, and it looks like the master at is spending a significant amount of time doing framework update broadcasting. I'll attach the perf dump to the ticket, as well as the log of what the master did while it was alive.
It sounds like currently, every framework update (100 total frameworks in our case) results in a broadcast to all 6000-11000 agents (depending on how busy the cluster is). Also, since our health checks rely on the UI currently, we usually end up killing the master because it fails a health check for long periods of time while overwhelmed by doing these broadcasts.
Could optimizations to be made to either throttle these broadcasts or to only target nodes which need those framework updates?
Attachments
Attachments
Issue Links
- is blocked by
-
MESOS-1961 Ensure executor state is correctly reconciled between master and slave.
- Accepted
- is related to
-
MESOS-5874 Only send ShutdownFrameworkMessage to agents associated with framework.
- Open
- relates to
-
MESOS-10006 Crash in Sorter: "Check failed: resources.contains(slaveId)"
- Open