Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2688

Need a way to run multiple data pipeline concurrently

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.0.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Suppose we want to do the following data processing:

      rdd1 -> rdd2 -> rdd3
                 | -> rdd4
                 | -> rdd5
                 \ -> rdd6
      

      where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too.

      This is required for Hive to support multi-insert queries. HIVE-7292.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                xuefuz Xuefu Zhang
              • Votes:
                1 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: