As outlined in the root cause analysis for
MESOS-5332, it is possible for a iptables firewall to terminate an idle connection after a timeout. (the default is 5 days). Once this happens, the executor driver is not notified of the disconnection. It keeps on thinking that it is still connected with the agent.
When the agent process is restarted, the executor still tries to re-use the old broken connection to send the re-register message to the agent. This is when it eventually realizes that the connection is broken (due to the nature of TCP) and calls the exited callback and commits suicide in 15 minutes upon the recovery timeout.
To offset this, an executor should always relink when it receives a reconnect request from the agent.