Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5412

Scan returns wrong partition-column values when scanning multiple partitions pointing to the same filesystem location.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Impala 2.10.0
    • Backend

    Description

      A scan against Avro, RCFile or SequenceFile may wrong partition-column values when scanning multiple partitions pointing to the same filesystem location.

      For example, the following setup may return fewer rows than expected, or have incorrect counts.

      // Table contents
      partition_col=1 points to /user/hive/warehouse/shared_dir/000000_0
      partition_col=2 points to /user/hive/warehouse/shared_dir/000000_0
      // Query may return wrong results
      SELECT COUNT(*) FROM t GROUP BY partition_col
      

      In particular, COMPUTE STATS uses the query above to populate the per-partition row counts, so those stored row counts may be incorrect.

      This bug only affects the Avro, RCFile or SequenceFile formats and does not affect Text, Parquet or non-filesystem tables like Kudu.

      The problematic code can be found in hdfs-scan-node-base.h:

        /// Scanner specific per file metadata (e.g. header information) and associated lock.
        boost::mutex metadata_lock_;
        std::map<std::string, void*> per_file_metadata_;
      

      The same file path could belong to multiple partitions, so a scanner may pick up the wrong per-file metadata which includes the partition values.
      Note that the key in this map is the full file path, no just the file name, so this bug is specific to partitions pointing to the same location.

      Attachments

        Activity

          People

            gaborkaszab Gabor Kaszab
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: