-
Type:
New Feature
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 1.2.0
-
Component/s: Spark Core
-
Labels:None
-
Target Version/s:
For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that
- groups by key: provides (Key, Iterator[Value])
- within each partition, provides keys in sorted order
A couple ways that could make sense to expose this:
- Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
- Allow groupByKey to take an ordering param for keys within a partition
- is depended upon by
-
HIVE-7384 Research into reduce-side join [Spark Branch]
-
- Resolved
-
-
SPARK-3145 Hive on Spark umbrella
-
- Resolved
-
- is related to
-
SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
-
- Resolved
-
- links to