Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4315

Improve quota failover logic.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The Quota failover logic introduced with MESOS-3865 changes the master failover recovery significantly if at least one quota is set.

      Now, if upon recovery any previously set quota has been detected, the allocator enters recovery mode, during which the allocator does not issue offers. The recovery mode — and therefore offer suspension — ends when either:

      • a certain amount of agents reregisters (by default 80% of agents known before the failover),
      • a timeout expires (by default 10 minutes).

      We could also safely exit the recovery mode, once all quotas have been satisfied (i.e. all agents participating in satisfying quota have reconnected). For small clusters a large percentage of quota'ed resources this will not make too much difference compared to the existing rules. But for larger clusters this condition could be fulfilled much faster than the 80% condition.

      We should at least consider whether such condition is worth the added complexity.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              js84 Jörg Schad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: