[SPARK-2045] Sort-based shuffle implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Shuffle, Spark Core
Labels:
None

Target Version/s:

1.1.0

Description

Building on the pluggability in ~~SPARK-2044~~, a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don't have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Sort-basedshuffledesign.pdf
15/Jul/14 21:52
86 kB
Matei Alexandru Zaharia

Issue Links

depends upon

SPARK-2044 Pluggable interface for shuffles

Resolved

is depended upon by

SPARK-2213 Sort Merge Join

Resolved

relates to

SPARK-3655 Support sorting of values in addition to keys (i.e. secondary sort)

Resolved

links to

[Github] Pull Request #1499 (mateiz)

Activity

People

Assignee:: Matei Alexandru Zaharia

Reporter:: Matei Alexandru Zaharia

Votes:: 0 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 05/Jun/14 23:15

Updated:: 23/Sep/14 02:52

Resolved:: 31/Jul/14 05:49