Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that
- groups by key: provides (Key, Iterator[Value])
- within each partition, provides keys in sorted order
A couple ways that could make sense to expose this:
- Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
- Allow groupByKey to take an ordering param for keys within a partition
Attachments
Issue Links
- is depended upon by
-
HIVE-7384 Research into reduce-side join [Spark Branch]
- Resolved
-
SPARK-3145 Hive on Spark umbrella
- Resolved
- is related to
-
SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
- Resolved
- links to