Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
HiveKey's hash code is generated by multipling by 31 key by key which is implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
for (int i = 0; i < bucketFields.length; i++)
The follow example will lead to data skew:
I hava two table called tbl1 and tbl2 and they have the same column: a int, b string. The values of column 'a' in both two tables are not skew, but values of column 'b' in both two tables are skew.
When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data skew.
As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the result, the job will be skew.