Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
Mesos Q3 Sprint 5, Mesos Q3 Sprint 6
-
2
Description
In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs:
→ Master and Slave connected operating normally.
→ Temporary one-way network failure, master→slave link breaks.
→ Master marks slave as disconnected.
→ Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again.
→ Slave remains disconnected according to the master, and the slave does not try to re-register. Bad!
We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation.
Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master.
Attachments
Issue Links
- is blocked by
-
MESOS-1811 Reconcile disconnected/deactivated semantics in the master code
- Resolved
- is related to
-
MESOS-1879 Handle a temporary one-way slave --> master socket closure.
- Accepted
- relates to
-
MESOS-1529 Handle a network partition between Master and Slave
- Resolved