[SPARK-2978] Provide an MR-style shuffle transformation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: Spark Core
Labels:
None

Target Version/s:

1.2.0

Description

For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that

groups by key: provides (Key, Iterator[Value])
within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:

Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
Allow groupByKey to take an ordering param for keys within a partition

Attachments

Issue Links

is depended upon by

HIVE-7384 Research into reduce-side join [Spark Branch]

Resolved

SPARK-3145 Hive on Spark umbrella

Resolved

is related to

SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

Resolved

links to

[Github] Pull Request #2274 (sryza)

Activity

People

Assignee:: Sandy Ryza

Reporter:: Sandy Ryza

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 12/Aug/14 00:57

Updated:: 08/Sep/14 18:21

Resolved:: 08/Sep/14 18:21