Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5200

agent->master messages use temporary TCP connections

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Background info: When an agent is started it starts a background task (libprocess process?) to detect the leading master. When the leading master is detected (or changes) the SocketManager's link() method is called and a TCP connection to the master is established. The connection is used by the agent to send messages to the master, and the master, upon receiving a RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP connection back to the agent. Each TCP connection is uni-directional, the agent writes messages on one connection and reads messages from the other, and the master reads/writes from the opposite ends of the connections.

      If the initial TCP connection to the master fails to be established then temporary connections are used for all agent->master messages; each send() causes a new TCP connection to be setup, the message sent, then the connection torn down. If link() succeeds a persistent TCP connection is used instead.

      If agents do not use ZK to detect the master then the master detector "detects" the master immediately and attempts to connect immediately. The master may not be listening for connections at the time, or it could be overwhelmed w/ TCP connection attempts, therefore the initial TCP connection attempt fails. The agent does not attempt to establish a new persistent connection as link() is only called when a new master is detected, which only occurs once unless ZK is used.

      It's possible for agents to overwhelm a master w/ TCP connections such that agents cannot establish connections. When this occurs pong messages may not be received by the master so the master shuts down agents thus killing any tasks they were running. We have witnessed this scenario during scale/load tests at Twitter.

      The problem is trivial to reproduce: configure an agent to use a certain master (--master=10.20.30.40:5050), start the agent, wait several minutes then start the master. All the agent->master messages will occur over temporary connections.

      The problem occurs less frequently in production because ZK is typically used for master detection and a master only registers in ZK after it has started listening on its socket. However, the scenario described above can also occur when ZK is used – a thundering herd of 10,000+ slaves establishing TCP connections to the master can result in some connection attempts failing and agents using temporary connections.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            drobinson Daniel Robinson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment