[HIVE-14797] reducer number estimating may lead to data skew - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Query Processor
Labels:
- backward-incompatible

Description

HiveKey's hash code is generated by multipling by 31 key by key which is implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
for (int i = 0; i < bucketFields.length; i++)

{ int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], bucketFieldInspectors[i]); hashCode = 31 * hashCode + fieldHash; }

The follow example will lead to data skew:

I hava two table called tbl1 and tbl2 and they have the same column: a int, b string. The values of column 'a' in both two tables are not skew, but values of column 'b' in both two tables are skew.

When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data skew.

As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the result, the job will be skew.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-14797.2.patch
21/Sep/16 12:22
3 kB
roncenzhao
HIVE-14797.3.patch
22/Sep/16 02:00
3 kB
roncenzhao
HIVE-14797.4.patch
18/Oct/16 02:59
3 kB
roncenzhao
HIVE-14797.patch
20/Sep/16 09:42
0.5 kB
roncenzhao

Activity

People

Assignee:: roncenzhao

Reporter:: roncenzhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Sep/16 08:36

Updated:: 17/Nov/22 11:31