Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7057

Consider using the relink functionality of libprocess in the executor driver.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.2, 1.1.0
    • Fix Version/s: 1.1.2, 1.2.0
    • Component/s: None
    • Labels:
    • Target Version/s:
    • Sprint:
      Mesosphere Sprint 51
    • Story Points:
      2

      Description

      As outlined in the root cause analysis for MESOS-5332, it is possible for a iptables firewall to terminate an idle connection after a timeout. (the default is 5 days). Once this happens, the executor driver is not notified of the disconnection. It keeps on thinking that it is still connected with the agent.

      When the agent process is restarted, the executor still tries to re-use the old broken connection to send the re-register message to the agent. This is when it eventually realizes that the connection is broken (due to the nature of TCP) and calls the exited callback and commits suicide in 15 minutes upon the recovery timeout.

      To offset this, an executor should always relink when it receives a reconnect request from the agent.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                anandmazumdar Anand Mazumdar
                Reporter:
                anandmazumdar Anand Mazumdar
                Shepherd:
                Vinod Kone
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: