[SPARK-24615] SPIP: Accelerator-aware task scheduling for Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: Spark Core
Labels:
- Hydrogen
- SPIP

Epic Name:
GPU-aware Scheduling

Description

(The JIRA received a major update on 2019/02/28. Some comments were based on an earlier version. Please ignore them. New comments start at comment-16778026.)

Background and Motivation

GPUs and other accelerators have been widely used for accelerating special workloads, e.g., deep learning and signal processing. While users from the AI community use GPUs heavily, they often need Apache Spark to load and process large datasets and to handle complex data scenarios like streaming. YARN and Kubernetes already support GPUs in their recent releases. Although Spark supports those two cluster managers, Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users.

To make Spark be aware of GPUs, we shall make two major changes at high level:

At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.

Based on the work done in YARN and Kubernetes to support GPUs and some offline prototypes, we could have necessary features implemented in the next major release of Spark. You can find a detailed scoping doc here, where we listed user stories and their priorities.

Goals

Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
No regression on scheduler performance for normal jobs.

Non-goals

Fine-grained scheduling within one GPU card.
- We treat one GPU card and its memory together as a non-divisible unit.
Support TPU.
Support Mesos.
Support Windows.

Target Personas

Admins who need to configure clusters to run Spark with GPU nodes.
Data scientists who need to build DL applications on Spark.
Developers who need to integrate DL features on Spark.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPIP_ Accelerator-aware scheduling.pdf
26/Feb/19 22:45
70 kB
Xiangrui Meng
Accelerator-aware scheduling in Apache Spark 3.0.pdf
26/Feb/19 22:45
128 kB
Xiangrui Meng

Issue Links

is related to

SPARK-29780 The UI can access into the ResourceAllocator, whose data structures are being updated from scheduler threads

Open

SPARK-30446 Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases

Resolved

SPARK-30448 accelerator aware scheduling enforce cores as limiting resource

Resolved

SPARK-27363 Mesos support for GPU-aware scheduling

Open

SPARK-27372 Standalone executor process-level isolation to support GPU scheduling

Open

SPARK-29762 GPU Scheduling - default task resource amount to 1

Open

SPARK-27005 Design sketch for SPIP discussion: Accelerator-aware scheduling

Resolved

SPARK-27365 Spark Jenkins supports testing GPU-aware scheduling features

Resolved

SPARK-29151 Support fraction resources for task resource scheduling

Resolved

relates to

SPARK-27024 Executor interface for cluster managers to support GPU resources

Resolved

links to

Product Doc

SPIP

(4 is related to, 1 relates to, 2 links to)

Activity

People

Assignee:: Thomas Graves

Reporter:: Saisai Shao

Shepherd:: Xiangrui Meng

Votes:: 13 Vote for this issue

Watchers:: 74 Start watching this issue

Dates

Created:: 21/Jun/18 08:29

Updated:: 12/Dec/22 18:10

Resolved:: 29/Jun/20 14:16