[HIVE-7958] SparkWork generated by SparkCompiler may require multiple Spark jobs to run - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Spark
Labels:
- Spark-M1

Description

A SparkWork instance currently may contain disjointed work graphs. For instance, union_remove_1.q may generated a plan like this:

Reduce2 <- Map 1
Reduce4 <- Map 3

The SparkPlan instance generated from this work graph contains two result RDDs. When such plan is executed, we call .foreach() on the two RDDs sequentially, which results two Spark jobs, one after the other.

While this works functionally, the performance will not be great as the Spark jobs are run sequentially rather than concurrently.

Another side effect of this is that the corresponding SparkPlan instance is over-complicated.

The are two potential approaches:

1. Let SparkCompiler generate a work that can be executed in ONE Spark job only. In above example, two Spark task should be generated.

2. Let SparkPlanGenerate generate multiple Spark plans and then SparkClient executes them concurrently.

Approach #1 seems more reasonable and naturally fit to our architecture. Also, Hive's task execution framework already takes care of the task concurrency.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-7958-spark.patch
04/Sep/14 22:44
3 kB
Xuefu Zhang

Issue Links

is required by

HIVE-7292 Hive on Spark

Resolved

Activity

People

Assignee:: Xuefu Zhang

Reporter:: Xuefu Zhang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Sep/14 15:55

Updated:: 16/Oct/14 23:00

Resolved:: 16/Oct/14 23:00