Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
4.0.0
Description
After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.
I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted.
sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). reduceByKey(_ + _, 1). asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). collect().foreach(println)
Attachments
Issue Links
- links to