Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7639

Oversubscription could crash the master due to CHECK failure in the allocator

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10.0
    • Component/s: allocation
    • Labels:
      None

      Description

      As I described in MESOS-7566, the following scenario is possible when the agent sends updated oversubscribed resources to the master:

      • The agent's UpdateSlaveMessage reduces the the oversubscribed resources.
      • Master::updateSlave upon receiving the update would first call HierarchicalAllocatorProcess::updateSlave, followed by allocator->recoverResources.
      • HierarchicalAllocatorProcess::updateSlave would update roleSorter.total_ to reduce to total so the total could go below the allocation.
      • In the subsequent allocator->recoverResources call the attempt to remove outstanding allocation may fail to reduce it to below the total because some allocation may not be in outstanding offers. It could be in offered resources pending between Master::accept and Master::_accept. So the end result could still be total < allocation.
      • Then when Master::_accept is executed, it will then call allocator->updateAllocation, in which the total < allocation condition could trigger such crash.

      The gist is that there are resources that are neither in master's offers or tracked in the allocator when Master::updateSlave is called.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                xujyan Yan Xu
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: