[SPARK-3622] Provide a custom transformation that can output multiple RDDs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

All existing transformations return just one RDD at most, even for those which takes user-supplied functions such as mapPartitions() . However, sometimes a user provided function may need to output multiple RDDs. For instance, a filter function that divides the input RDD into serveral RDDs. While it's possible to get multiple RDDs by transforming the same RDD multiple times, it may be more efficient to do this concurrently in one shot. Especially user's existing function is already generating different data sets.

This the case in Hive on Spark, where Hive's map function and reduce function can output different data sets to be consumed by subsequent stages.

Attachments

Issue Links

is depended upon by

SPARK-3145 Hive on Spark umbrella

Resolved

relates to

SPARK-2688 Need a way to run multiple data pipeline concurrently

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Xuefu Zhang

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 21/Sep/14 05:38

Updated:: 02/Feb/15 19:09

Resolved:: 25/Jan/15 17:33