Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
1.0.1
-
None
-
None
Description
Suppose we want to do the following data processing:
rdd1 -> rdd2 -> rdd3 | -> rdd4 | -> rdd5 \ -> rdd6
where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too.
This is required for Hive to support multi-insert queries. HIVE-7292.
Attachments
Issue Links
- blocks
-
HIVE-7503 Support Hive's multi-table insert query with Spark [Spark Branch]
- Resolved
- is depended upon by
-
SPARK-3145 Hive on Spark umbrella
- Resolved
- is related to
-
HIVE-9492 Enable caching in MapInput for Spark
- Open
-
SPARK-3622 Provide a custom transformation that can output multiple RDDs
- Resolved
- is required by
-
HIVE-7292 Hive on Spark
- Resolved