[SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.2.0
Component/s: Shuffle, Spark Core
Labels:
- release-notes

Target Version/s:

3.2.0

Description

In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces.

Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

vldb_magnet_final.pdf
23/Aug/20 04:36
728 kB
Min Shen
Screen Shot 2020-06-23 at 11.31.22 AM.jpg
24/Jun/20 16:59
253 kB
Min Shen

Issue Links

relates to

SPARK-36389 Revert the change that accepts negative mapId in ShuffleBlockId

Resolved

SPARK-33235 Push-based Shuffle Improvement Tasks

Open

requires

SPARK-36273 Comparison of identical values

Resolved

links to

[Github] Pull Request #29808 (otterc)

SPIP: Spark push based shuffle

(1 links to)

Sub-Tasks

1.	Push-based shuffle documentation	Resolved	Venkata krishnan Sowrirajan
2.	RPC implementation to support pushing and merging shuffle blocks	Resolved	Min Shen
3.	Add support for external shuffle service in YARN deployment mode to leverage push-based shuffle	Resolved	Chandni Singh
4.	Add support for executors to push shuffle blocks after successful map task completion	Resolved	Chandni Singh
5.	RPC implementation to support control plane coordination for push-based shuffle	Resolved	Ye Zhou
6.	Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions	Resolved	Venkata krishnan Sowrirajan
7.	Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage	Resolved	Venkata krishnan Sowrirajan
8.	Extend MapOutputTracker to support tracking and serving the metadata about each merged shuffle partitions for a given shuffle in push-based shuffle scenario	Resolved	Venkata krishnan Sowrirajan
9.	Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures	Resolved	Chandni Singh
10.	Add support to properly handle different type of stage retries	Resolved	Venkata krishnan Sowrirajan
11.	Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data	Resolved	Ye Zhou
12.	Fix cases of corruption in merged shuffle blocks that are pushed	Resolved	Chandni Singh
13.	Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way	Resolved	Ye Zhou
14.	Add Support in the ESS to serve merged shuffle block meta and data to executors	Resolved	Chandni Singh
15.	Disable push-based shuffle until the feature is complete	Resolved	Unassigned
16.	FileNotFoundException from the shuffle push can cause the executor to terminate	Resolved	Chandni Singh
17.	Rename classes in shuffle RPC used for block push operations	Resolved	Min Shen
18.	Avoid finalizing when there's no push at all in a shuffle	Resolved	Unassigned
19.	Stage has all tasks finished but with ongoing finalization can cause job hang	Resolved	Unassigned
20.	Disable push based shuffle when IO encryption is enabled or serializer is not relocatable	Resolved	Minchu Yang
21.	Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true	Resolved	Aravind Patnam

Activity

People

Assignee:: Min Shen

Reporter:: Min Shen

Shepherd:: Mridul Muralidharan

Votes:: 19 Vote for this issue

Watchers:: 100 Start watching this issue

Dates

Created:: 22/Jan/20 01:15

Updated:: 14/Nov/23 20:50

Resolved:: 02/Aug/21 05:16