[SPARK-2787] Make sort-based shuffle write files directly when there is no sorting / aggregation and # of partitions is small - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: None
Labels:
None

Target Version/s:

1.1.0

Description

Right now sort-based shuffle is slower than hash-based for operations like groupByKey where data is passed straight from the map task to the reduce task, because it keeps building up a buffer in memory and spilling it to disk instead of directly opening more files (thus having more GC pressure), and then it has to read back and merge all the spills (incurring both serialization cost and GC pressure). When the number of partitions is small enough (say less than 100 or 200), we should just open N files, write stuff to them as in hash-based shuffle, and then concatenate them at the end to still end up with a single file (avoiding the many-files problem of the hash-based implementation).

It may also be possible to avoid the concatenation but that introduces complexity in serving the files, so I'd try concatenating them for now. We can benchmark it to see the performance.

Attachments

Issue Links

links to

[Github] Pull Request #1799 (mateiz)

Activity

People

Assignee:: Matei Alexandru Zaharia

Reporter:: Matei Alexandru Zaharia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Aug/14 07:25

Updated:: 08/Aug/14 01:07

Resolved:: 08/Aug/14 01:07