Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-beta1
-
None
-
Reviewed
Description
In FSAppAttempt#canContainerBePreempted, we make sure that preempting the given container would not put the app below its fair share:
// Check if the app's allocation will be over its fairshare even // after preempting this container Resource usageAfterPreemption = Resources.clone(getResourceUsage()); // Subtract resources of containers already queued for preemption synchronized (preemptionVariablesLock) { Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); } // Subtract this container's allocation to compute usage after preemption Resources.subtractFrom( usageAfterPreemption, container.getAllocatedResource()); return !isUsageBelowShare(usageAfterPreemption, getFairShare());
However, this only considers one container in isolation, and fails to consider containers for the same app that we already added to preemptableContainers in FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a case where we preempt multiple containers from the same app, none of which by itself puts the app below fair share, but which cumulatively do so.
I've attached a patch with a test to show this behavior. The flow is:
1. Initially greedyApp runs in root.preemptable.child-1 and is allocated all the resources (8g and 8vcores)
2. Then starvingApp runs in root.preemptable.child-2 and requests 2 containers, each of which is 3g and 3vcores in size. At this point both greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
3. For the first container requested by starvedApp, we (correctly) preempt 3 containers from greedyApp, each of which is 1g and 1vcore.
4. For the second container requested by starvedApp, we again (this time incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below its fair share, but happens anyway because all six times that we call return !isUsageBelowShare(usageAfterPreemption, getFairShare());, the value of usageAfterPreemption is 7g and 7vcores (confirmed using debugger).
So in addition to accounting for resourcesToBePreempted, we also need to account for containers that we're already planning on preempting in FSPreemptionThread#identifyContainersToPreemptOnNode.