Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7569

Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.3, 1.2.2, 1.3.1, 1.4.0
    • Component/s: agent
    • Labels:
      None

      Description

      Users who have executors in their cluster without the fix to MESOS-7057 will experience these executors potentially being destroyed whenever the agent restarts (or is upgraded).

      This occurs when these old executors have connections idle for > 5 days (default conntrack tcp timeout). At this point, the connection is timedout and no longer tracked by conntrack. From what we've seen, if the agent stays up, the packets still flow between the executor and agent. However, once the agent restarts, in some cases (presence of a DROP rule, or some flavors of NATing), the executor does not receive the RST/FIN from the kernel and will hold a half-open TCP connection. At this point, when the executor responds to the reconnect message from the restarted agent, it's half-open TCP connection closes, and the executor will be destroyed by the agent.

      In order to allow users to preserve the tasks running in these "old" executors (i.e. without the MESOS-7057 fix), we can add optional retrying of the reconnect message in the agent. This allows the old executor to correctly establish a link to agent, when the second reconnect message is handled.

      Longer term, heartbeating or TCP keepalives will prevent the connections from reaching the conntrack timeout (see MESOS-7568).

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              bmahler Benjamin Mahler
              Reporter:
              bmahler Benjamin Mahler

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment