Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1668

Handle a temporary one-way master --> slave socket closure.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: agent, master
    • Labels:
    • Epic Link:
    • Sprint:
      Mesos Q3 Sprint 5, Mesos Q3 Sprint 6
    • Story Points:
      2

      Description

      In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs:

      → Master and Slave connected operating normally.
      → Temporary one-way network failure, master→slave link breaks.
      → Master marks slave as disconnected.
      → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again.
      → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad!

      We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation.

      Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                vinodkone Vinod Kone
                Reporter:
                bmahler Benjamin Mahler
                Shepherd:
                Benjamin Mahler
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: