Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 1.3, Impala 1.4, Impala 2.0, Impala 2.1, Impala 2.2
-
None
Description
The Fnv64to32 hash function is pathologically bad when hashing empty strings. This hash function is used to partition data across nodes and can result in all the data being sent to one node.
The code in question is
static uint32_t FnvHash64to32(const void* data, int32_t bytes, uint32_t hash) { uint64_t hash_u64 = hash | ((uint64_t)hash << 32); hash_u64 = FnvHash64(data, bytes, hash_u64); return (hash_u64 >> 32) ^ (hash_u64 & 0xFFFFFFFF); }
For an empty string (bytes == 0), FnvHash64 will return the original seed (hash_u64). This degenerates the folding logic to xoring the seed by itself and always returning 0.
The result of this behavior is that any hash partition of the from: hash (...., empty col) will result in the computed hash being 0. The last column wipes out whatever hash was generated by the prior columns. There is some special logic to handle NULLs (RawValue::HashCombine32) which can probably be used for this case as well.