[SPARK-25889] Dynamic allocation load-aware ramp up - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: Scheduler, Spark Core, YARN
Labels:
- bulk-closed

Description

The time based exponential ramp up behavior for dynamic allocation is naive and destructive, making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions (200,000+) the default dynamic allocation approach of requesting containers in waves, doubling exponentially once per second, results in 50% of the entire cluster being requested in the final 1 second wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup steps to break systems. With the interval so short, many additional containers may be requested beyond what is actually needed and then complete very little work before sitting around waiting to be deallocated.

Delaying the time between these fixed doublings only has limited impact. Setting double intervals to once per minute would result in a very slow ramp up speed, at the end of which we still face large potentially crippling waves of executor startup.

An alternative approach to spooling up large job appears to be needed, which is still relatively simple but could be more adaptable to different cluster sizes and differing cluster and job performance.

I would like to propose a few different approaches based around the general idea of controlling outstanding requests for new containers based on the number of executors that are currently running, for some definition of "running".

One example might be to limit requests to one new executor for every existing executor that currently has an active task. Or some ratio of that, to allow for more or less aggressive spool up. A lower number would let us approximate something like fibonacci ramp up, a higher number of say 2x would spool up quickly, but still aligned with the rate at which broadcast blocks can be easily distributed to new members.

An alternative approach might be to limit the escalation rate of new executor requests based on the number of outstanding executors requested which have not yet fully completed startup and are not available for tasks. To protect against a potentially suboptimal very early ramp, a minimum concurrent executor startup threshold might allow an initial burst of say 10 executors, after which the more gradual ramp math would apply.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Adam Kennedy

Shepherd:: DB Tsai

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 30/Oct/18 20:54

Updated:: 25/May/21 01:54

Resolved:: 25/May/21 01:42