Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
This can be used to enact a "Hadoop Style" shuffle along with a call to mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better.