[SPARK-46512] Optimize shuffle reading when both sort and combine are used. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Shuffle, Spark Core
Labels:
- pull-request-available

Description

After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.

I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted.

sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
reduceByKey(_ + _, 1).
asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
collect().foreach(println)

Attachments

Issue Links

links to

GitHub Pull Request #44512

Activity

People

Assignee:: Chenyu Zheng

Reporter:: Chenyu Zheng

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Dec/23 09:48

Updated:: 26/Feb/24 01:40

Resolved:: 04/Feb/24 22:55