Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10005

Only broadcast framework update to agents associated with framework

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • 1.9.0
    • None
    • master
    • None
    • Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.

    Description

      We have at any given time ~100 frameworks connected to our Mesos Master with agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been encounter a crash (which I'll document separately) and when that happens, the new Mesos Master will sometimes (but not always) struggle to catch up, and eventually crash again. Usually the third or fourth crash will end with a stable master (not ideal, but at least we can get to stable).

      Looking over the logs, I'm seeing hundreds of attempts to contact dead agents each second (and presumably many contacts with healthy agents that don't throw an error):

      W1003 21:39:39.299998  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: Failed to connect to 100.82.103.99:5051: Connection refused W1003 21:39:39.300143  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: Failed to connect to 100.85.122.190:5051: Connection refused W1003 21:39:39.300285  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: Failed to connect to 100.85.84.187:5051: Connection refused W1003 21:39:39.302122  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: Failed to connect to 100.82.163.228:5051: Connection refused

      I gave bmahler a perf trace of the master on Slack at this point, and it looks like the master at is spending a significant amount of time doing framework update broadcasting. I'll attach the perf dump to the ticket, as well as the log of what the master did while it was alive.

      It sounds like currently, every framework update (100 total frameworks in our case) results in a broadcast to all 6000-11000 agents (depending on how busy the cluster is). Also, since our health checks rely on the UI currently, we usually end up killing the master because it fails a health check for long periods of time while overwhelmed by doing these broadcasts.

      Could optimizations to be made to either throttle these broadcasts or to only target nodes which need those framework updates?

      Attachments

        1. 0001-Send-framework-updates-in-parallel.patch
          2 kB
          Benjamin Mahler
        2. mesos-master.log.gz
          2.41 MB
          Terra Field
        3. mesos-master.stacks - 2 - 1.9.0.gz
          11.01 MB
          Terra Field
        4. mesos-master.stacks - 3 - 1.9.0.gz
          12.53 MB
          Terra Field
        5. mesos-master.stacks - 4 - framework update - 1.9.0.gz
          12.49 MB
          Terra Field
        6. mesos-master.stacks - 5 - new healthy master.gz
          13.19 MB
          Terra Field

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tfield Terra Field
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: