[HIVE-10550] Dynamic RDD caching optimization for HoS.[Spark Branch] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.0, 2.0.0
Component/s: Spark
Labels:
None

Description

A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, TPC-DS Q39 is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-10550.1.patch
13/May/15 09:25
52 kB
Chengxiang Li
HIVE-10550.1-spark.patch
13/May/15 12:33
52 kB
Xuefu Zhang
HIVE-10550.2-spark.patch
18/May/15 09:56
61 kB
Chengxiang Li
HIVE-10550.3-spark.patch
19/May/15 09:08
42 kB
Chengxiang Li
HIVE-10550.4-spark.patch
22/May/15 06:16
18 kB
Chengxiang Li
HIVE-10550.5-spark.patch
27/May/15 01:52
18 kB
Chengxiang Li
HIVE-10550.6-spark.patch
28/May/15 03:31
10 kB
Chengxiang Li

Issue Links

is related to

HIVE-10850 Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch]

Open

relates to

HIVE-10844 Combine equivalent Works for HoS[Spark Branch]

Closed

links to

https://reviews.apache.org/r/34455/

Activity

People

Assignee:: Chengxiang Li

Reporter:: Chengxiang Li

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 30/Apr/15 07:27

Updated:: 16/Feb/16 23:53

Resolved:: 28/May/15 05:30