Description
For Hive's multi insert query (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there may be an MR job for each insert. When we achieve this with Spark, it would be nice if all the inserts can happen concurrently.
It seems that this functionality isn't available in Spark. To make things worse, the source of the insert may be re-computed unless it's staged. Even with this, the inserts will happen sequentially, making the performance suffer.
This task is to find out what takes in Spark to enable this without requiring staging the source and sequential insertion. If this has to be solved in Hive, find out an optimum way to do this.
Attachments
Attachments
Issue Links
- is blocked by
-
SPARK-2688 Need a way to run multiple data pipeline concurrently
- Resolved
- is depended upon by
-
HIVE-7842 Enable qtest load_dyn_part1.q [Spark Branch]
- Resolved
-
HIVE-8233 multi-table insertion doesn't work with ForwardOperator [Spark Branch]
- Resolved
-
HIVE-8208 Multi-table insertion optimization #1: don't always break operator tree. [Spark Branch]
- Resolved
-
HIVE-8215 Multi-table insertion optimization #3: use 1+1 tasks instead of 1+N tasks [Spark Branch]
- Resolved
-
HIVE-8209 Multi-table insertion optimization #2: use separate context [Spark Branch]
- Resolved
-
HIVE-8207 Add .q tests for multi-table insertion [Spark Branch]
- Resolved
- relates to
-
HIVE-7731 Incorrect result returned when a map work has multiple downstream reduce works [Spark Branch]
- Resolved
-
HIVE-8438 Clean up code introduced by HIVE-7503 and such [Spark Plan]
- Resolved
-
HIVE-8219 Multi-Insert optimization, don't sink the source into a file [Spark Branch]
- Resolved
-
HIVE-8220 Refactor multi-insert code such that plan splitting and task generation are modular and reusable [Spark Branch]
- Resolved
- requires
-
HIVE-7525 Research to find out if it's possible to submit Spark jobs concurrently using shared SparkContext [Spark Branch]
- Resolved
- links to