[HIVE-8118] Support work that have multiple child works to work around SPARK-3622 [Spark Branch] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Spark
Labels:
- Spark-M1

Description

In the current implementation, both SparkMapRecordHandler and SparkReduceRecorderHandler takes only one result collector, which limits that the corresponding map or reduce task can have only one child. It's very comment in multi-insert queries where a map/reduce task has more than one children. A query like the following has two map tasks as parents:

select name, sum(value) from dec group by name union all select name, value from dec order by name

It's possible in the future an optimation may be implemented so that a map work is followed by two reduce works and then connected to a union work.

Thus, we should take this as a general case. Tez is currently providing a collector for each child operator in the map-side or reduce side operator tree. We can take Tez as a reference.

Spark currently doesn't have a tranformation that supports mutliple output datasets from a single input dataset (~~SPARK-3622~~). This is a workaround for this gap.

Likely this is a big change and subtasks are possible.

With this, we can have a simpler and clean multi-insert implementation. This is also the problem observed in ~~HIVE-7731~~ and ~~HIVE-7503~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-8118.pdf
12/Oct/14 02:33
112 kB
Xuefu Zhang

Issue Links

depends upon

HIVE-8274 Refactoring SparkPlan and SparkPlanGeneration [Spark Branch]

Resolved

is depended upon by

HIVE-8533 Enable all q-tests for multi-insertion [Spark Branch]

Resolved

is part of

HIVE-7292 Hive on Spark

Resolved

relates to

HIVE-7731 Incorrect result returned when a map work has multiple downstream reduce works [Spark Branch]

Resolved

HIVE-8457 MapOperator initialization fails when multiple Spark threads is enabled [Spark Branch]

Resolved

Sub-Tasks

1.	Modify SparkWork to split works with multiple child works [Spark Branch]	Resolved	Chao Sun
2.	Modify SparkPlan generation to set toCache flag to SparkTrans where caching is needed [Spark Branch]	Resolved	Unassigned
3.	Clean up code introduced by HIVE-7503 and such [Spark Plan]	Resolved	Chao Sun

Activity

People

Assignee:: Chao Sun

Reporter:: Xuefu Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Sep/14 04:12

Updated:: 29/May/15 02:28

Resolved:: 24/Oct/14 18:33