Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The Quota failover logic introduced with MESOS-3865 changes the master failover recovery significantly if at least one quota is set.
Now, if upon recovery any previously set quota has been detected, the allocator enters recovery mode, during which the allocator does not issue offers. The recovery mode — and therefore offer suspension — ends when either:
- a certain amount of agents reregisters (by default 80% of agents known before the failover),
- a timeout expires (by default 10 minutes).
We could also safely exit the recovery mode, once all quotas have been satisfied (i.e. all agents participating in satisfying quota have reconnected). For small clusters a large percentage of quota'ed resources this will not make too much difference compared to the existing rules. But for larger clusters this condition could be fulfilled much faster than the 80% condition.
We should at least consider whether such condition is worth the added complexity.
Attachments
Issue Links
- relates to
-
MESOS-4443 Refactor allocator recovery.
- Open