[HIVE-4827] Merge a Map-only task to its child task - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.12.0
Component/s: Query Processor
Labels:
None

Release Note:

Hide
Before applying this jira to trunk, CommonJoinTaskDispatcher has two methods, mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. The first method tries to merge a map-only task (for MapJoin) to its child map-only task. The second method tries to merge a map-only task to its child MapReduce task (a task has a reducer). There was a flag called "hive.optimize.mapjoin.mapreduce" to determine if mergeMapJoinTaskWithMapReduceTask will be called.

This work combines mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. So, a map-only task will be merged into its child task no matter the child task is a map-only task or a MapReduce task. So hive.optimize.mapjoin.mapreduce is not needed any more.

If a user wants to disable merging a map-only task to its child task, he or she can use either
set hive.auto.convert.join.noconditionaltask=false;
or
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=0;

Show
Before applying this jira to trunk, CommonJoinTaskDispatcher has two methods, mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. The first method tries to merge a map-only task (for MapJoin) to its child map-only task. The second method tries to merge a map-only task to its child MapReduce task (a task has a reducer). There was a flag called "hive.optimize.mapjoin.mapreduce" to determine if mergeMapJoinTaskWithMapReduceTask will be called. This work combines mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. So, a map-only task will be merged into its child task no matter the child task is a map-only task or a MapReduce task. So hive.optimize.mapjoin.mapreduce is not needed any more. If a user wants to disable merging a map-only task to its child task, he or she can use either set hive.auto.convert.join.noconditionaltask=false; or set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=0;

Description

When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a Map-only job (MapJoin) to its following MapReduce job. But this merge only happens when the MapReduce job has a single input. With Correlation Optimizer (~~HIVE-2206~~), it is possible that the MapReduce job can have multiple inputs (for multiple operation paths). It is desired to improve CommonJoinResolver to merge a Map-only job to the corresponding Map task of the MapReduce job.

Example:

set hive.optimize.correlation=true;
set hive.auto.convert.join=true;
set hive.optimize.mapjoin.mapreduce=true;
SELECT tmp1.key, count(*)
FROM (SELECT x1.key1 AS key
      FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
      GROUP BY x1.key1) tmp1
JOIN (SELECT x2.key2 AS key
      FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
      GROUP BY x2.key2) tmp2
ON (tmp1.key = tmp2.key)
GROUP BY tmp1.key;

In this query, join operations inside tmp1 and tmp2 will be converted to two MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of tmp1 and tmp2, and the last aggregation will be executed in the same MapReduce job (Reduce side). Since this MapReduce job has two inputs, right now, CommonJoinResolver cannot attach two MapJoins to the Map side of a MapReduce job.

Another example:

SELECT tmp1.key
FROM (SELECT x1.key2 AS key
      FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
      UNION ALL
      SELECT x2.key2 AS key
      FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1

For this case, we will have three Map-only jobs (two for MapJoins and one for Union). It will be good to use a single Map-only job to execute this query.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-4827.8.patch
31/Jul/13 03:39
505 kB
Yin Huai
HIVE-4827.7.patch
31/Jul/13 01:06
505 kB
Yin Huai
HIVE-4827.6.patch
30/Jul/13 19:35
333 kB
Yin Huai
HIVE-4827.5.patch
27/Jul/13 00:23
333 kB
Yin Huai
HIVE-4827.4.patch
26/Jul/13 18:51
336 kB
Yin Huai
HIVE-4827.3.patch
25/Jul/13 22:50
214 kB
Yin Huai
HIVE-4827.2.patch
22/Jul/13 04:20
167 kB
Yin Huai
HIVE-4827.1.patch
21/Jul/13 05:38
109 kB
Yin Huai

Issue Links

is related to

HIVE-5891 Alias conflict when merging multiple mapjoin tasks into their common child mapred task

Resolved

HIVE-4927 When we merge two MapJoin MapRedTasks, the TableScanOperator of the second one should be removed

Closed

relates to

HIVE-2206 add a new optimizer for query correlation discovery and optimization

Closed

HIVE-3952 merge map-job followed by map-reduce job

Closed

links to

Review board

Merge a Map-only task to its child task

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates