Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 4.3.0
-
None
-
ghx-label-3
Description
DictEncoder currently uses this hash function for TimestampValue:
template<typename T> inline uint32_t DictEncoder<T>::Hash(const T& value) const { return HashUtil::Hash(&value, sizeof(value), 0); }
TimestampValue has some padding, and nothing ensures that the padding is cleared. This means that identical TimestampValue objects can hash to different values.
This came up when fixing a Clang-Tidy performance check. This line in dict-test.cc changed from iterating over values to iterating over const references.
DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, &track_encoder); encoder.UsedbyTest(); <<<<<< for (InternalType i: values) encoder.Put(i); ===== for (const InternalType& i: values) encoder.Put(i); >>>>> bytes_alloc = encoder.DictByteSize(); EXPECT_EQ(track_encoder.consumption(), bytes_alloc); EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------
The test became flaky, with the encoder.num_entries() being larger than the values_set.size() for TimestampValue. This happened because the hash values didn't match even for identical entries and the dictionary would have multiple copies of the same value. When iterating over a plain non-reference TimestampValue, each TimestampValue is being copied to a temporary value. Maybe in this circumstance the padding stays the same between iterations.
It's possible this would come up when writing Parquet data files.
One fix would be to use TimestampValue's Hash function, which ignores the padding:
template<> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& value) const { return value.Hash(); }
Attachments
Issue Links
- relates to
-
IMPALA-12390 Enable performance related clang-tidy checks
- Resolved