For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that
- groups by key: provides (Key, Iterator[Value])
- within each partition, provides keys in sorted order
A couple ways that could make sense to expose this:
- Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
- Allow groupByKey to take an ordering param for keys within a partition