[SPARK-2044] Pluggable interface for shuffles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Shuffle, Spark Core
Labels:
None

Target Version/s:

1.2.0

Description

Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are:

Push-based shuffle where data moves directly from mappers to reducers
Sorting-based instead of hash-based shuffle, to create fewer files (helps a lot with file handles and memory usage on large shuffles)
External spilling within a key
Changing the level of parallelism or even algorithm for downstream stages at runtime based on statistics of the map output (this is a thing we had prototyped in the Shark research project but never merged in core)

I've attached a design doc with a proposed interface. It's not too crazy because the interface between shuffles and the rest of the code is already pretty narrow (just some iterators for reading data and a writer interface for writing it). Bigger changes will be needed in the interaction with DAGScheduler and BlockManager for some of the ideas above, but we can handle those separately, and this interface will allow us to experiment with some short-term stuff sooner.

If things go well I'd also like to send a sort-based shuffle implementation for 1.1, but we'll see how the timing on that works out.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Pluggableshuffleproposal.pdf
05/Jun/14 23:08
96 kB
Matei Alexandru Zaharia

Issue Links

is depended upon by

SPARK-2045 Sort-based shuffle implementation

Resolved

is related to

SPARK-1733 Pluggable storage support for BlockManager

Resolved

relates to

SPARK-2114 groupByKey and joins on raw data

Resolved

SPARK-2275 More general Storage Interface for Shuffle / Spill etc.

Closed

Sub-Tasks

1.	Basic pluggable interface for shuffle	Resolved	Matei Alexandru Zaharia
2.	Move aggregation into ShuffleManager implementations	Resolved	Saisai Shao
3.	Add sorting flag to ShuffleManager, and implement it in HashShuffleManager	Resolved	Saisai Shao
4.	Move MapOutputTracker behind ShuffleManager interface	Resolved	Nan Zhu

Activity

People

Assignee:: Matei Alexandru Zaharia

Reporter:: Matei Alexandru Zaharia

Votes:: 1 Vote for this issue

Watchers:: 33 Start watching this issue

Dates

Created:: 05/Jun/14 23:07

Updated:: 21/Apr/15 06:32

Resolved:: 21/Apr/15 06:32