Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3, Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1, Mesosphere Q1 Sprint 9 - 5/15
-
2
Description
Currently the DRFSorter aggregates total and allocated resources across multiple slaves, which only works for scalar resources. We need to distinguish resources from different slaves.
Suppose we have 2 slaves and 1 framework. The framework is allocated all resources from both slaves.
Resources slaveResources = Resources::parse("cpus:2;mem:512;ports:[31000-32000]").get(); DRFSorter sorter; sorter.add(slaveResources); // Add slave1 resources sorter.add(slaveResources); // Add slave2 resources // Total resources in sorter at this point is // cpus(*):4; mem(*):1024; ports(*):[31000-32000]. // The scalar resources get aggregated correctly but ports do not. sorter.add("F"); // The 2 calls to allocated only works because we simply do: // allocation[name] += resources; // without checking that the 'resources' is available in the total. sorter.allocated("F", slaveResources); sorter.allocated("F", slaveResources); // At this point, sorter.allocation("F") is: // cpus(*):4; mem(*):1024; ports(*):[31000-32000].
To provide some context, this issue came up while trying to reserve all unreserved resources from every offer.
for (const Offer& offer : offers) { Resources unreserved = offer.resources().unreserved(); Resources reserved = unreserved.flatten(role, Resource::FRAMEWORK); Offer::Operation reserve; reserve.set_type(Offer::Operation::RESERVE); reserve.mutable_reserve()->mutable_resources()->CopyFrom(reserved); driver->acceptOffers({offer.id()}, {reserve}); }
Suppose the slave resources are the same as above:
Slave1: cpus(*):2; mem(*):512; ports(*):[31000-32000]
Slave2: cpus(*):2; mem(*):512; ports(*):[31000-32000]
Initial (incorrect) total resources in the DRFSorter is:
cpus(*):4; mem(*):1024; ports(*):[31000-32000]
We receive 2 offers, 1 from each slave:
Offer1: cpus(*):2; mem(*):512; ports(*):[31000-32000]
Offer2: cpus(*):2; mem(*):512; ports(*):[31000-32000]
At this point, the resources allocated for the framework is:
cpus(*):4; mem(*):1024; ports(*):[31000-32000]
After first RESERVE operation with Offer1:
The allocated resources for the framework becomes:
cpus(*):2; mem(*):512; cpus(role):2; mem(role):512; ports(role):[31000-32000]
During second RESERVE operation with Offer2:
// ... FrameworkSorter* frameworkSorter = frameworkSorters[frameworks\[frameworkId\].role]; Resources allocation = frameworkSorter->allocation(frameworkId.value()); // Update the allocated resources. Try<Resources> updatedAllocation = allocation.apply(operations); CHECK_SOME(updatedAllocation); // ...
allocation in the above code is:
cpus(*):2; mem(*):512; cpus(role):2; mem(role):512; ports(role):[31000-32000]
We try to apply a RESERVE operation and we fail to find ports(*):[31000-32000] which leads to the CHECK fail at CHECK_SOME(updatedAllocation);
Attachments
Issue Links
- relates to
-
MESOS-2891 Performance regression in hierarchical allocator.
- Resolved
-
MESOS-2623 Report correct states of (used|offered)Resources of master::Framework for state.json
- Accepted