[HIVE-23438] Missing Rows When Left Outer Join In N-way HybridGraceHashJoin - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.4
Fix Version/s: None
Component/s: SQL, Tez
Labels:
- pull-request-available

Description

Run Test in Patch File

mvn test -Dtest=TestMiniTezCliDriver -Dqfile=hybridgrace_hashjoin_2.q

Manual Reproduce

STEP 1. Create test data(q_test_init_tez.sql)

//create table src1
CREATE TABLE src1 (key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv3.txt" INTO TABLE src1;

//create table src2
CREATE TABLE src2(key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv11.txt" OVERWRITE INTO TABLE src2;

//create table srcpart
CREATE TABLE srcpart (key STRING COMMENT 'default', value STRING COMMENT 'default')
PARTITIONED BY (ds STRING, hr STRING)
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="11");

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="12");

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="11");

LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="12");

STEP 2. Run query

set hive.auto.convert.join=true; 
set hive.auto.convert.join.noconditionaltask=true; 
set hive.auto.convert.join.noconditionaltask.size=10000000; 
set hive.cbo.enable=false;
set hive.mapjoin.hybridgrace.hashtable=true;

select *
from
(
select key from src1 group by key
) x
left join src2 z on x.key = z.key
join
(
select key from srcpart y group by key
) y on y.key = x.key;

EXPECTED RESULT**

128	NULL	NULL	128
146	146	1val_1461	146
150	150	1val_1501	150
238	NULL	NULL	238
369	NULL	NULL	369
406	406	1val_4061	406
273	273	1val_2731	273
98	NULL	NULL	98
213	213	1val_2131	213
255	NULL	NULL	255
401	401	1val_4011	401
278	NULL	NULL	278
66	66	11val_6611	66
224	NULL	NULL	224
311	NULL	NULL	311

ACTUAL RESULT

128	NULL	NULL	128
146	146	1val_1461	146
150	150	1val_1501	150
213	213	1val_2131	213
238	NULL	NULL	238
273	273	1val_2731	273
369	NULL	NULL	369
406	406	1val_4061	406
98	NULL	NULL	98
401	401	1val_4011	401
66	66	11val_6611	66

ROOT CAUSE

src1 left join src2, src1 is big table and src2 is small table. Join result between big table row and the corresponding hashtable maybe NO_MATCH state, however, these NO_MATCH rows is needed because LEFT OUTER JOIN.

In addition, these big table rows will not spilled into matchfile related to this hashtable on disk because only SPILL state can use `spillBigTableRow`. Then, these big table rows will be spilled into matchfile in hashtables of table `srcpart`(second small table)

Finally, when reProcessBigTable, big table rows in matchfile are only read from `firstSmallTable`, some datum are missing.

WORKAROUND

configure firstSmallTable in completeInitializationOp and only spill big table row into firstSmallTable when spill matchfile.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-23438.branch-2.3.patch
14/May/20 03:00
21 kB
Yizhen Fan
HIVE-23438.001.branch-2.3.patch
14/May/20 12:21
21 kB
Yizhen Fan

Issue Links

links to

GitHub Pull Request #1014

Activity

People

Assignee:: Yizhen Fan

Reporter:: Yizhen Fan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/May/20 13:34

Updated:: 21/Jul/20 00:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m