Details
-
Epic
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.0
-
GPU-aware Scheduling
Description
(The JIRA received a major update on 2019/02/28. Some comments were based on an earlier version. Please ignore them. New comments start at comment-16778026.)
Background and Motivation
GPUs and other accelerators have been widely used for accelerating special workloads, e.g., deep learning and signal processing. While users from the AI community use GPUs heavily, they often need Apache Spark to load and process large datasets and to handle complex data scenarios like streaming. YARN and Kubernetes already support GPUs in their recent releases. Although Spark supports those two cluster managers, Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users.
To make Spark be aware of GPUs, we shall make two major changes at high level:
- At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
- Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
Based on the work done in YARN and Kubernetes to support GPUs and some offline prototypes, we could have necessary features implemented in the next major release of Spark. You can find a detailed scoping doc here, where we listed user stories and their priorities.
Goals
- Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
- No regression on scheduler performance for normal jobs.
Non-goals
- Fine-grained scheduling within one GPU card.
- We treat one GPU card and its memory together as a non-divisible unit.
- Support TPU.
- Support Mesos.
- Support Windows.
Target Personas
- Admins who need to configure clusters to run Spark with GPU nodes.
- Data scientists who need to build DL applications on Spark.
- Developers who need to integrate DL features on Spark.
Attachments
Attachments
Issue Links
- is related to
-
SPARK-29780 The UI can access into the ResourceAllocator, whose data structures are being updated from scheduler threads
- Open
-
SPARK-30446 Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
- Resolved
-
SPARK-30448 accelerator aware scheduling enforce cores as limiting resource
- Resolved
-
SPARK-27363 Mesos support for GPU-aware scheduling
- Open
-
SPARK-27372 Standalone executor process-level isolation to support GPU scheduling
- Open
-
SPARK-29762 GPU Scheduling - default task resource amount to 1
- Open
-
SPARK-27005 Design sketch for SPIP discussion: Accelerator-aware scheduling
- Resolved
-
SPARK-27365 Spark Jenkins supports testing GPU-aware scheduling features
- Resolved
-
SPARK-29151 Support fraction resources for task resource scheduling
- Resolved
- relates to
-
SPARK-27024 Executor interface for cluster managers to support GPU resources
- Resolved
- links to
maybe I"m missing it but how is this working with the resource manager?
rdd.map { xxxx }
.reduceByKey { xxx }
.mapPartition { accelerator required tasks }
.enableAccelerator()
.collect()
If I asked for 200 executors up front, it has to know this at time it asks for them.