Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
-
ghx-label-5
Description
A scan against Avro, RCFile or SequenceFile may wrong partition-column values when scanning multiple partitions pointing to the same filesystem location.
For example, the following setup may return fewer rows than expected, or have incorrect counts.
// Table contents partition_col=1 points to /user/hive/warehouse/shared_dir/000000_0 partition_col=2 points to /user/hive/warehouse/shared_dir/000000_0 // Query may return wrong results SELECT COUNT(*) FROM t GROUP BY partition_col
In particular, COMPUTE STATS uses the query above to populate the per-partition row counts, so those stored row counts may be incorrect.
This bug only affects the Avro, RCFile or SequenceFile formats and does not affect Text, Parquet or non-filesystem tables like Kudu.
The problematic code can be found in hdfs-scan-node-base.h:
/// Scanner specific per file metadata (e.g. header information) and associated lock.
boost::mutex metadata_lock_;
std::map<std::string, void*> per_file_metadata_;
The same file path could belong to multiple partitions, so a scanner may pick up the wrong per-file metadata which includes the partition values.
Note that the key in this map is the full file path, no just the file name, so this bug is specific to partitions pointing to the same location.