Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1668

Handle a temporary one-way master --> slave socket closure.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.21.0
    • agent, master
    • Mesos Q3 Sprint 5, Mesos Q3 Sprint 6
    • 2

    Description

      In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs:

      → Master and Slave connected operating normally.
      → Temporary one-way network failure, master→slave link breaks.
      → Master marks slave as disconnected.
      → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again.
      → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad!

      We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation.

      Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master.

      Attachments

        Issue Links

          Activity

            People

              vinodkone Vinod Kone
              bmahler Benjamin Mahler
              Benjamin Mahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: