[SPARK-2926] Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
- bulk-closed

Epic Link:
Tungsten Phase 2

Description

Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations.

Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle.

Working in progress code and performance test report will be posted later when some unit test bugs are fixed.

Any comments would be greatly appreciated.
Thanks a lot.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Spark Shuffle Test Report on Spark2.x.pdf
13/Dec/17 12:18
186 kB
Yuanjian Li
Spark Shuffle Test Report(contd).pdf
12/Sep/14 02:28
326 kB
Saisai Shao
Spark Shuffle Test Report.pdf
14/Aug/14 05:47
340 kB
Saisai Shao
SortBasedShuffleReader on Spark 2.x.pdf
14/Nov/17 13:54
3.52 MB
Yuanjian Li
SortBasedShuffleRead.pdf
08/Aug/14 13:03
56 kB
Saisai Shao

Issue Links

blocks

SPARK-2213 Sort Merge Join

Resolved

SPARK-3056 Sort-based Aggregation

Resolved

is duplicated by

SPARK-2114 groupByKey and joins on raw data

Resolved

relates to

SPARK-2978 Provide an MR-style shuffle transformation

Resolved

links to

[Github] Pull Request #3438 (jerryshao)

[Github] Pull Request #19745 (xuanyuanking)

(1 links to)

Activity

People

Assignee:: Saisai Shao

Reporter:: Saisai Shao

Votes:: 7 Vote for this issue

Watchers:: 65 Start watching this issue

Dates

Created:: 08/Aug/14 13:00

Updated:: 17/May/20 18:31

Resolved:: 21/May/19 04:12