Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5200

agent->master messages use temporary TCP connections

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Background info: When an agent is started it starts a background task (libprocess process?) to detect the leading master. When the leading master is detected (or changes) the SocketManager's link() method is called and a TCP connection to the master is established. The connection is used by the agent to send messages to the master, and the master, upon receiving a RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP connection back to the agent. Each TCP connection is uni-directional, the agent writes messages on one connection and reads messages from the other, and the master reads/writes from the opposite ends of the connections.

      If the initial TCP connection to the master fails to be established then temporary connections are used for all agent->master messages; each send() causes a new TCP connection to be setup, the message sent, then the connection torn down. If link() succeeds a persistent TCP connection is used instead.

      If agents do not use ZK to detect the master then the master detector "detects" the master immediately and attempts to connect immediately. The master may not be listening for connections at the time, or it could be overwhelmed w/ TCP connection attempts, therefore the initial TCP connection attempt fails. The agent does not attempt to establish a new persistent connection as link() is only called when a new master is detected, which only occurs once unless ZK is used.

      It's possible for agents to overwhelm a master w/ TCP connections such that agents cannot establish connections. When this occurs pong messages may not be received by the master so the master shuts down agents thus killing any tasks they were running. We have witnessed this scenario during scale/load tests at Twitter.

      The problem is trivial to reproduce: configure an agent to use a certain master (--master=10.20.30.40:5050), start the agent, wait several minutes then start the master. All the agent->master messages will occur over temporary connections.

      The problem occurs less frequently in production because ZK is typically used for master detection and a master only registers in ZK after it has started listening on its socket. However, the scenario described above can also occur when ZK is used – a thundering herd of 10,000+ slaves establishing TCP connections to the master can result in some connection attempts failing and agents using temporary connections.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              drobinson Daniel Robinson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: