Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6632

Document compatibility of table and column stats between Impala and Hive

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Docs
    • None
    • ghx-label-8

    Description

      The question of compatibility between the table and column stats between Hive and Impala comes up quite often, so is worth documenting explicitly.

      Quoting myself from a recent discussion thread to get the docs effort started:

      Commonalities:

      • Hive and Impala both store row counts at the table level and partition level. Hive also computes and stores additional stats like file counts which Impala does not need or use.

      Differences:

      • Impala computes and stores column-level stats like the number of distinct values (NDV) only at the table level, and not at the partition level.
      • Hive computes and stores column-level stats at the partition level. Impala does not follow this approach because the per-partition NDVs cannot be sensibly combined for queries that access multiple partitions. In short, the column stats for partitioned tables are not compatible between Impala and Hive (because imo Hive's approach does not make sense).
      • Impala uses a more modern and tuned algorithm (HyperLogLog++) for estimating the number of distinct values, so they tend to be more accurate than Hive's. Your mileage may vary.
      • For unpartitioned tables, the Hive and Impala column stats are compatible.

      For partitioned tables, the table-level column stats that Impala writes in the Metastore are stored just like for unpartitioned tables. These statistics are "available" to Hive in the sense that the standard retrieval APIs will work as expected. My understanding is that for partitioned tables, Hive does not use the table-level column stats, but instead expects partition-level column stats. As I've said before, these partition-level column stats do not make any sense because it is not possible to sensibly combine them for multiple partitions.

      Attachments

        Activity

          People

            arodoni Alexandra Rodoni
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: