Details
Description
(See details in the linked/attached SPIP doc.)
The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage.
Attachments
Attachments
Issue Links
- relates to
-
MAPREDUCE-2911 Hamster: Hadoop And Mpi on the same cluSTER
- Resolved
-
SPARK-1485 Implement AllReduce
- Resolved
-
SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
- Resolved
- links to
Issues in epic
|
SPARK-24375 | Design sketch: support barrier scheduling in Apache Spark | Resolved | Xingbo Jiang | ||
SPARK-24580 | List scenarios to be handled by barrier execution mode properly | Open | Xingbo Jiang | |||
|
SPARK-24581 | Design: BarrierTaskContext.barrier() | Resolved | Xingbo Jiang | ||
|
SPARK-24582 | Design: Barrier execution mode | Resolved | Xingbo Jiang | ||
SPARK-24723 | Discuss necessary info and access in barrier mode + YARN | Open | Saisai Shao | |||
SPARK-24724 | Discuss necessary info and access in barrier mode + Kubernetes | Open | Yinan Li | |||
SPARK-24725 | Discuss necessary info and access in barrier mode + Mesos | Open | Unassigned | |||
|
SPARK-24726 | Discuss necessary info and access in barrier mode + Standalone | Resolved | Xiangrui Meng | ||
|
SPARK-24795 | Implement barrier execution mode | Resolved | Xingbo Jiang | ||
|
SPARK-24817 | Implement BarrierTaskContext.barrier() | Resolved | Xingbo Jiang | ||
|
SPARK-24818 | Ensure all the barrier tasks in the same stage are launched together | Resolved | wuyi | ||
|
SPARK-24819 | Fail fast when no enough slots to launch the barrier stage on job submitted | Resolved | Xingbo Jiang | ||
|
SPARK-24820 | Fail fast when submitted job contains PartitionPruningRDD in a barrier stage | Resolved | Xingbo Jiang | ||
|
SPARK-24821 | Fail fast when submitted job compute on a subset of all the partitions for a barrier stage | Resolved | Xingbo Jiang | ||
|
SPARK-24822 | Python support for barrier execution mode | Resolved | Xingbo Jiang | ||
SPARK-24823 | Cancel a job that contains barrier stage(s) if the barrier tasks don't get launched within a configured time | Open | Unassigned | |||
SPARK-24824 | Make Spark task speculation a per-stage config | Open | Unassigned | |||
SPARK-24874 | Allow hybrid of both barrier tasks and regular tasks in a stage | Open | Unassigned | |||
SPARK-24877 | Ignore the task completion event from a zombie barrier task | Open | Unassigned | |||
SPARK-24941 | Add RDDBarrier.coalesce() function | Open | Unassigned | |||
SPARK-24942 | Improve cluster resource management with jobs containing barrier stage | Open | Unassigned | |||
|
SPARK-24954 | Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled | Resolved | Xingbo Jiang | ||
|
SPARK-25017 | Add test suite for ContextBarrierState | Resolved | Unassigned | ||
|
SPARK-25045 | Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions` | Resolved | Xingbo Jiang | ||
|
SPARK-25095 | Python support for BarrierTaskContext | Resolved | Xingbo Jiang | ||
|
SPARK-25161 | Fix several bugs in failure handling of barrier execution mode | Resolved | Xingbo Jiang | ||
|
SPARK-25247 | Make RDDBarrier configurable | Resolved | Unassigned | ||
|
SPARK-25248 | Audit barrier APIs for Spark 2.4 | Resolved | Xiangrui Meng | ||
|
SPARK-25265 | Fix memory leak in Barrier Execution Mode | Resolved | Kousuke Saruta | ||
|
SPARK-25266 | Fix memory leak in Barrier Execution Mode | Resolved | Kousuke Saruta | ||
|
SPARK-25376 | Scenarios we should handle but missed in 2.4 for barrier execution mode | Resolved | Unassigned | ||
|
SPARK-28561 | DAG viz for barrier-execution mode | Resolved | Kousuke Saruta |