[SPARK-3461] Support external groupByKey using repartitionAndSortWithinPartitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: Spark Core
Labels:
None

Description

Given that we have ~~SPARK-2978~~, it seems like we could support an external group by operator pretty easily. We'd just have to wrap the existing iterator exposed by ~~SPARK-2978~~ with a lookahead iterator that detects the group boundaries. Also, we'd have to override the cache() operator to cache the parent RDD so that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the stated semantics of ~~SPARK-2978~~ it seems like this would be possible.

It would be really nice to externalize this because many beginner users write jobs in terms of groupByKey.

Attachments

Issue Links

links to

[Github] Pull Request #3198 (sryza)

Activity

People

Assignee:: Reynold Xin

Reporter:: Patrick Wendell

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 09/Sep/14 17:24

Updated:: 10/Dec/15 13:31

Resolved:: 09/Dec/15 15:15