[YARN-569] CapacityScheduler: support for preemption (using a capacity monitor) - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0-beta
Component/s: capacityscheduler
Labels:
None

Target Version/s:

2.1.0-beta
Hadoop Flags:

Reviewed

Description

There is a tension between the fast-pace reactive role of the CapacityScheduler, which needs to respond quickly to
applications resource requests, and node updates, and the more introspective, time-based considerations
needed to observe and correct for capacity balance. To this purpose we opted instead of hacking the delicate
mechanisms of the CapacityScheduler directly to add support for preemption by means of a "Capacity Monitor",
which can be run optionally as a separate service (much like the NMLivelinessMonitor).

The capacity monitor (similarly to equivalent functionalities in the fairness scheduler) operates running on intervals
(e.g., every 3 seconds), observe the state of the assignment of resources to queues from the capacity scheduler,
performs off-line computation to determine if preemption is needed, and how best to "edit" the current schedule to
improve capacity, and generates events that produce four possible actions:

Container de-reservations
Resource-based preemptions
Container-based preemptions
Container killing

The actions listed above are progressively more costly, and it is up to the policy to use them as desired to achieve the rebalancing goals.
Note that due to the "lag" in the effect of these actions the policy should operate at the macroscopic level (e.g., preempt tens of containers
from a queue) and not trying to tightly and consistently micromanage container allocations.

------------- Preemption policy (ProportionalCapacityPreemptionPolicy): -------------

Preemption policies are by design pluggable, in the following we present an initial policy (ProportionalCapacityPreemptionPolicy) we have been experimenting with. The ProportionalCapacityPreemptionPolicy behaves as follows:

it gathers from the scheduler the state of the queues, in particular, their current capacity, guaranteed capacity and pending requests
if there are pending requests from queues that are under capacity it computes a new ideal balanced state (**)
it computes the set of preemptions needed to repair the current schedule and achieve capacity balance (accounting for natural completion rates, and
respecting bounds on the amount of preemption we allow for each round)
it selects which applications to preempt from each over-capacity queue (the last one in the FIFO order)
it remove reservations from the most recently assigned app until the amount of resource to reclaim is obtained, or until no more reservations exits
(if not enough) it issues preemptions for containers from the same applications (reverse chronological order, last assigned container first) again until necessary or until no containers except the AM container are left,
(if not enough) it moves onto unreserve and preempt from the next application.
containers that have been asked to preempt are tracked across executions. If a containers is among the one to be preempted for more than a certain time, the container is moved in a the list of containers to be forcibly killed.

Notes:
at the moment, in order to avoid double-counting of the requests, we only look at the "ANY" part of pending resource requests, which means we might not preempt on behalf of AMs that ask only for specific locations but not any.
(**) The ideal balance state is one in which each queue has at least its guaranteed capacity, and the spare capacity is distributed among queues (that wants some) as a weighted fair share. Where the weighting is based on the guaranteed capacity of a queue, and the function runs to a fix point.

Tunables of the ProportionalCapacityPreemptionPolicy:

observe-only mode (i.e., log the actions it would take, but behave as read-only)
how frequently to run the policy
how long to wait between preemption and kill of a container
which fraction of the containers I would like to obtain should I preempt (has to do with the natural rate at which containers are returned)
deadzone size, i.e., what % of over-capacity should I ignore (if we are off perfect balance by some small % we ignore it)
overall amount of preemption we can afford for each run of the policy (in terms of total cluster capacity)

In our current experiments this set of tunables seem to be a good start to shape the preemption action properly. More sophisticated preemption policies could take into account different type of applications running, job priorities, cost of preemption, integral of capacity imbalance. This is very much a control-theory kind of problem, and some of the lessons on designing and tuning controllers are likely to apply.

Generality:
The monitor-based scheduler edit, and the preemption mechanisms we introduced here are designed to be more general than enforcing capacity/fairness, in fact, we are considering other monitors that leverage the same idea of "schedule edits" to target different global properties (e.g., allocate enough resources to guarantee deadlines for important jobs, or data-locality optimizations, IO-balancing among nodes, etc...).

Note that by default the preemption policy we describe is disabled in the patch.

Depends on ~~YARN-45~~ and ~~YARN-567~~, is related to ~~YARN-568~~

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-569.patch
23/Apr/13 05:37
78 kB
Carlo Curino
YARN-569.patch
27/Apr/13 02:13
77 kB
Carlo Curino
YARN-569.9.patch
19/Jun/13 20:47
96 kB
Christopher Douglas
YARN-569.8.patch
19/Jun/13 01:53
94 kB
Christopher Douglas
YARN-569.6.patch
13/Jun/13 04:22
92 kB
Christopher Douglas
YARN-569.5.patch
13/Jun/13 02:11
92 kB
Christopher Douglas
YARN-569.4.patch
04/Jun/13 06:29
92 kB
Christopher Douglas
YARN-569.3.patch
31/May/13 23:40
92 kB
Christopher Douglas
YARN-569.2.patch
18/May/13 00:16
92 kB
Carlo Curino
YARN-569.11.patch
11/Jul/13 00:27
98 kB
Christopher Douglas
YARN-569.10.patch
24/Jun/13 23:07
97 kB
Christopher Douglas
YARN-569.1.patch
09/May/13 00:42
79 kB
Carlo Curino
preemption.2.patch
04/May/13 18:36
51 kB
Bikas Saha
CapScheduler_with_preemption.pdf
11/Apr/13 13:48
108 kB
Carlo Curino
3queues.pdf
11/Apr/13 13:48
123 kB
Carlo Curino

Issue Links

breaks

YARN-1398 Deadlock in capacity scheduler leaf queue and parent queue for getQueueInfo and completedContainer call

Closed

duplicates

MAPREDUCE-533 Support task preemption in Capacity Scheduler

Resolved

is blocked by

YARN-567 RM changes to support preemption for FairScheduler and CapacityScheduler

Closed

Is contained by

MAPREDUCE-4584 Umbrella: Preemption and restart of MapReduce tasks

Open

relates to

MAPREDUCE-5189 Basic AM changes to support preemption requests (per YARN-45)

Resolved

MAPREDUCE-5176 Preemptable annotations (to support preemption in MR)

Closed

YARN-568 FairScheduler: support for work-preserving preemption

Closed

(2 relates to)

CapacityScheduler: support for preemption (using a capacity monitor)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates