[HIVE-2206] add a new optimizer for query correlation discovery and optimization - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.12.0
Component/s: Query Processor
Labels:
None

Release Note:
This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs.

Description

This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer. Here are three examples.

Support queries only involve TC;
Support queries in which input tables of correlated MR jobs involves intermediate tables; and
Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-2206.1.patch.txt
18/Sep/11 01:46
190 kB
Yin Huai
HIVE-2206.10-r1384442.patch.txt
14/Sep/12 21:37
341 kB
Yin Huai
HIVE-2206.11-r1385084.patch.txt
15/Sep/12 23:18
250 kB
Yin Huai
HIVE-2206.12-r1386996.patch.txt
18/Sep/12 17:44
308 kB
Yin Huai
HIVE-2206.13-r1389072.patch.txt
24/Sep/12 15:55
500 kB
Yin Huai
HIVE-2206.14-r1389704.patch.txt
26/Sep/12 16:03
499 kB
Yin Huai
HIVE-2206.15-r1392491.patch.txt
02/Oct/12 15:44
492 kB
Yin Huai
HIVE-2206.16-r1399936.patch.txt
19/Oct/12 15:48
492 kB
Yin Huai
HIVE-2206.17-r1404933.patch.txt
03/Nov/12 01:52
491 kB
Yin Huai
HIVE-2206.18-r1407720.patch.txt
12/Nov/12 22:21
491 kB
Yin Huai
HIVE-2206.19-r1410581.patch.txt
19/Nov/12 19:58
508 kB
Yin Huai
HIVE-2206.2.patch.txt
18/Sep/11 23:25
190 kB
Yin Huai
HIVE-2206.20-r1434012.patch.txt
16/Jan/13 19:29
512 kB
Yin Huai
HIVE-2206.3.patch.txt
19/Sep/11 12:57
190 kB
Yin Huai
HIVE-2206.4.patch.txt
19/Sep/11 23:08
190 kB
Yin Huai
HIVE-2206.5.patch.txt
21/Sep/11 12:36
255 kB
Yin Huai
HIVE-2206.5-1.patch.txt
23/Nov/11 20:26
209 kB
Yin Huai
HIVE-2206.6.patch.txt
04/Dec/11 23:57
156 kB
Yin Huai
HIVE-2206.7.patch.txt
05/Dec/11 19:15
221 kB
Yin Huai
HIVE-2206.8.r1224646.patch.txt
29/Dec/11 18:51
219 kB
Yin Huai
HIVE-2206.8-r1237253.patch.txt
29/Jan/12 17:55
225 kB
Yin Huai
HIVE-2206.D11097.1.patch
05/Jun/13 02:27
756 kB
Phabricator
HIVE-2206.D11097.10.patch
21/Jun/13 23:33
980 kB
Phabricator
HIVE-2206.D11097.11.patch
27/Jun/13 05:31
7 kB
Phabricator
HIVE-2206.D11097.12.patch
27/Jun/13 05:37
1003 kB
Phabricator
HIVE-2206.D11097.13.patch
29/Jun/13 00:05
1.03 MB
Phabricator
HIVE-2206.D11097.14.patch
01/Jul/13 22:02
1.09 MB
Phabricator
HIVE-2206.D11097.15.patch
02/Jul/13 19:12
1.10 MB
Phabricator
HIVE-2206.D11097.16.patch
08/Jul/13 23:21
1.20 MB
Phabricator
HIVE-2206.D11097.17.patch
12/Jul/13 20:15
1.19 MB
Phabricator
HIVE-2206.D11097.18.patch
14/Jul/13 22:38
1.20 MB
Phabricator
HIVE-2206.D11097.19.patch
17/Jul/13 22:02
1.20 MB
Phabricator
HIVE-2206.D11097.2.patch
06/Jun/13 18:58
586 kB
Phabricator
HIVE-2206.D11097.20.patch
19/Jul/13 16:46
46 kB
Phabricator
HIVE-2206.D11097.21.patch
27/Aug/13 18:34
5 kB
Phabricator
HIVE-2206.D11097.22.patch
27/Aug/13 18:42
5 kB
Phabricator
HIVE-2206.D11097.3.patch
07/Jun/13 02:16
581 kB
Phabricator
HIVE-2206.D11097.4.patch
07/Jun/13 17:00
581 kB
Phabricator
HIVE-2206.D11097.5.patch
07/Jun/13 17:38
582 kB
Phabricator
HIVE-2206.D11097.6.patch
12/Jun/13 19:15
698 kB
Phabricator
HIVE-2206.D11097.7.patch
13/Jun/13 00:19
725 kB
Phabricator
HIVE-2206.D11097.8.patch
16/Jun/13 04:33
868 kB
Phabricator
HIVE-2206.D11097.9.patch
18/Jun/13 21:05
923 kB
Phabricator
HIVE-2206.patch
18/Jul/13 01:03
1.20 MB
Yin Huai
testQueries.2.q
29/Dec/11 18:58
5 kB
Yin Huai
YSmartPatchForHive.patch
08/Jun/11 06:41
251 kB
He Yongqiang

Issue Links

blocks

HIVE-3668 Merge MapReduce jobs which share input tables and the same partitioning keys into a single MapReduce job

Open

HIVE-3669 Support queries in which input tables of correlated MR jobs involves intermediate tables

Open

HIVE-3670 Optimize queries involving self join

Resolved

HIVE-3671 If a query has been optimized by correlation optimizer, join auto convert cannot optimize it

Resolved

is blocked by

HIVE-4572 ColumnPruner cannot preserve RS key columns corresponding to un-selected join keys in columnExprMap

Closed

is related to

HIVE-1772 optimize join followed by a groupby

Resolved

HIVE-3430 group by followed by join with the same key should be optimized

Resolved

HIVE-4827 Merge a Map-only task to its child task

Closed

HIVE-7362 Enabling Correlation Optimizer by default.

Open

is required by

HIVE-3667 Umbrella jira for Correlation Optimizer

Open

relates to

HIVE-2340 optimize orderby followed by a groupby

Closed

HIVE-3773 Share input scan by unions across multiple queries

In Progress

(4 is related to, 1 is required by, 2 relates to)

add a new optimizer for query correlation discovery and optimization

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates