To persist quotas across failovers, the Master should save them in the registry. To support this, we shall:
- Introduce a Quota state variable in registry.proto;
- Extend the Operation interface so that it supports a ‘Quota’ accumulator (see src/master/registrar.hpp);
- Introduce AddQuota / RemoveQuota operations;
- Recover quotas from the registry on failover to the Master’s internal::master::Role struct;
- Extend RegistrarTest with quota-specific tests.
NOTE: Registry variable can be rather big for production clusters (see
MESOS-2075). While it should be fine for MVP to add quota information to registry, we should consider storing Quota separately, as this does not need to be in sync with slaves update. However, currently adding more variable is not supported by the registrar.
While the Agents are reregistering (note they may fail to do so), the information about what part of the quota is allocated is only partially available to the Master. In other words, the state of the quota allocation is reconstructed as Agents reregister. During this period, some roles may be under quota from the perspective of the newly elected Master.
The same problem exists on the allocator side: it may think the cluster is under quota and may eagerly try to satisfy quotas before enough Agents reregister, which may result in resources being allocated to frameworks beyond their quota. To address this issue and also to avoid panicking and generating under quota alerts, the Master should give a certain amount of time for the majority (e.g. 80%) of the Agents to reregister before reporting any quota status and notifying the allocator about granted quotas.