[MESOS-7569] Allow "old" executors with half-open connections to be preserved during agent upgrade / restart. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.3, 1.2.2, 1.3.1, 1.4.0
Component/s: agent
Labels:
None

Target Version/s:

1.1.3, 1.2.2, 1.3.1, 1.4.0

Description

Users who have executors in their cluster without the fix to ~~MESOS-7057~~ will experience these executors potentially being destroyed whenever the agent restarts (or is upgraded).

This occurs when these old executors have connections idle for > 5 days (default conntrack tcp timeout). At this point, the connection is timedout and no longer tracked by conntrack. From what we've seen, if the agent stays up, the packets still flow between the executor and agent. However, once the agent restarts, in some cases (presence of a DROP rule, or some flavors of NATing), the executor does not receive the RST/FIN from the kernel and will hold a half-open TCP connection. At this point, when the executor responds to the reconnect message from the restarted agent, it's half-open TCP connection closes, and the executor will be destroyed by the agent.

In order to allow users to preserve the tasks running in these "old" executors (i.e. without the ~~MESOS-7057~~ fix), we can add optional retrying of the reconnect message in the agent. This allows the old executor to correctly establish a link to agent, when the second reconnect message is handled.

Longer term, heartbeating or TCP keepalives will prevent the connections from reaching the conntrack timeout (see MESOS-7568).

Attachments

Issue Links

is related to

MESOS-7540 Add an agent flag for executor re-registration timeout.

Resolved

relates to

MESOS-5332 TASK_LOST on slave restart potentially due to executor race condition

Resolved

MESOS-7057 Consider using the relink functionality of libprocess in the executor driver.

Resolved

MESOS-7568 Introduce a heartbeat mechanism for v0 executor <-> agent links.

Accepted

MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.

Accepted

Activity

People

Assignee:: Benjamin Mahler

Reporter:: Benjamin Mahler

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/May/17 00:56

Updated:: 27/May/17 02:24

Resolved:: 26/May/17 23:59