Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
The current implementation of Hash Join/Groupby suffers from a serious hash collision problem. In these two operators, we first used the hash exchange operator to assign each key to an NC partition (hash1(key)%P, where P is the number of partitions), and then build a hash table at each NC partition (hash2(key)%N, where N is the hash table size). However, our implementation currently uses the same hash function for both steps (hash1 == hash2). This is simply incorrect and can lead to a lot of hash collisions.
To see this problem, consider what happens to NC partition 0. After hash partitioning, for each key assigned to this partition, we know that hash(key)%P == 0. Unless the greatest common divisor of P and N is 0, there will be a lot hash collisions! For example, suppose P = 16 and N is a multiple of 4. Since hash(key) is a multiple of 16, we know for sure that hash(key)%N can be multiples of 4 as well! This implies that all slots that are multiples of 1,2,3 will always be empty, but all entries will be inserted into slots that are multiples of 4.
To fix this problem, we can simply use a different hash function for hash join/groupby.