Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
As SMB replaces last RS op from the joining branches and the JOIN op with MERGEJOIN, we need to ensure the RS before these RS, in both branches, are partitioning using same hash generator.
Hash code generator differs based on ReducerTraits.UNIFORM i.e. ReduceSinkOperator#computeMurmurHash() or ReduceSinkOperator#computeHashCode(), leading to different hash code for same value.
Skip SMB join in such cases.
Replication:
Consider following query, where join would get converted to SMB. Auto reducer is enabled which ensures more than 1 reducer task.
CREATE TABLE t_asj_18 (k STRING, v INT); INSERT INTO t_asj_18 values ('a', 10), ('a', 10); set hive.auto.convert.join=false; set hive.tez.auto.reducer.parallelism=true; EXPLAIN SELECT * FROM ( SELECT k, COUNT(DISTINCT v), SUM(v) FROM t_asj_18 GROUP BY k ) a LEFT JOIN ( SELECT k, COUNT(v) FROM t_asj_18 GROUP BY k ) b ON a.k = b.k;
Expected result is:
a 1 20 a 2
but on master branch, it results in
a 1 20 NULL NULL
Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In such scenario reducer trait UNIFORM is not set The hash code for "a" from 2nd subquery is generated using murmurHash (270516725) while 1st is generated using bucketHash (1086686554) and result in rows with "a" key reaching different reducer tasks.
Attachments
Issue Links
- links to