Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.2, 1.1.0
-
None
-
Mesosphere Sprint 51
-
2
Description
As outlined in the root cause analysis for MESOS-5332, it is possible for a iptables firewall to terminate an idle connection after a timeout. (the default is 5 days). Once this happens, the executor driver is not notified of the disconnection. It keeps on thinking that it is still connected with the agent.
When the agent process is restarted, the executor still tries to re-use the old broken connection to send the re-register message to the agent. This is when it eventually realizes that the connection is broken (due to the nature of TCP) and calls the exited callback and commits suicide in 15 minutes upon the recovery timeout.
To offset this, an executor should always relink when it receives a reconnect request from the agent.
Attachments
Issue Links
- is related to
-
MESOS-5332 TASK_LOST on slave restart potentially due to executor race condition
-
- Resolved
-
-
MESOS-7569 Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
-
- Resolved
-