[SPARK-2688] Need a way to run multiple data pipeline concurrently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.0.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Suppose we want to do the following data processing:

rdd1 -> rdd2 -> rdd3
           | -> rdd4
           | -> rdd5
           \ -> rdd6

where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too.

This is required for Hive to support multi-insert queries. ~~HIVE-7292~~.

Attachments

Issue Links

blocks

HIVE-7503 Support Hive's multi-table insert query with Spark [Spark Branch]

Resolved

is depended upon by

SPARK-3145 Hive on Spark umbrella

Resolved

is related to

HIVE-9492 Enable caching in MapInput for Spark

Open

SPARK-3622 Provide a custom transformation that can output multiple RDDs

Resolved

is required by

HIVE-7292 Hive on Spark

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Xuefu Zhang

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 25/Jul/14 13:14

Updated:: 03/Dec/15 14:00

Resolved:: 03/Dec/15 14:00