[HIVE-28480] Disable SMB on partition hash generator mismatch across join branches in previous RS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.1.0, 4.0.1
Component/s: Query Planning
Labels:

Description

As SMB replaces last RS op from the joining branches and the JOIN op with MERGEJOIN, we need to ensure the RS before these RS, in both branches, are partitioning using same hash generator.

Hash code generator differs based on ReducerTraits.UNIFORM i.e. ReduceSinkOperator#computeMurmurHash() or ReduceSinkOperator#computeHashCode(), leading to different hash code for same value.

Skip SMB join in such cases.

Replication:

Consider following query, where join would get converted to SMB. Auto reducer is enabled which ensures more than 1 reducer task.

CREATE TABLE t_asj_18 (k STRING, v INT);
INSERT INTO t_asj_18 values ('a', 10), ('a', 10);

set hive.auto.convert.join=false;
set hive.tez.auto.reducer.parallelism=true;

EXPLAIN SELECT * FROM (
    SELECT k, COUNT(DISTINCT v), SUM(v)
    FROM t_asj_18 GROUP BY k
) a LEFT JOIN (
    SELECT k, COUNT(v)
    FROM t_asj_18 GROUP BY k
) b ON a.k = b.k;

Expected result is:

a   1   20  a   2

but on master branch, it results in

a   1   20  NULL    NULL

Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In such scenario reducer trait UNIFORM is not set The hash code for "a" from 2nd subquery is generated using murmurHash (270516725) while 1st is generated using bucketHash (1086686554) and result in rows with "a" key reaching different reducer tasks.

Attachments

Issue Links

links to

GitHub Pull Request #5406

Activity

People

Assignee:: Himanshu Mishra

Reporter:: Himanshu Mishra

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Aug/24 21:23

Updated:: 20/Sep/24 09:27

Resolved:: 09/Sep/24 08:23