Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3976

Handle partition-key values with multiple synonymous string representations created in Hive.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: None
    • Component/s: Catalog

      Description

      For several SQL statements that can create new partitions, Hive seems to generate partition-key values and the corresponding HDFS directory based on the user's string input rather than the corresponding literal value of the appropriate column type. This leads to a situation where a single logical partition-key value can map to multiple HDFS directories and Hive partitions.

      Example in Hive:

      CREATE TABLE t (i INT) PARTITIONED BY (p INT);
      ALTER TABLE t ADD PARTITION (p=0);
      ALTER TABLE t ADD PARTITION (p=00);
      ALTER TABLE t ADD PARTITION (p=000);
      SHOW PARTITIONS t;
      p=0
      p=00
      p=000
      

      The above statements will result in three different HDFS directories, one for each of the "distinct" partitions.

      The same result can be achieved with static partition inserts from Hive, instead of ALTER TABLE ADD PARTITION.

      Note that Impala will a canonical representation for any partition-key value based on the underlying LiteralExpr, so a similarly strange metadata state cannot be created from Impala, even if given the same input as in the example above.

      A special case of this issue was reported in HIVE-6590 and IMPALA-3963, but the underlying problem is more general.

      Issues in Impala
      Impala has difficulties dealing with such ambiguous partitions due to the internal assumption that a single assignment of values to partition keys maps to a single Hive partition with a one corresponding HDFS directory.

      As long as the cached partition metadata in Impala is correct, queries will return correct results even with partition filters. Impala effectively coalesces the different partition variants, for example, SELECT * FROM t WHERE p=0 will scan all three directories from the example above.

      The following statements are known have problems in Impala if such ambiguous partitions exist:

      • REFRESH <table> and REFRESH <partition>. After such a statement Impala may duplicate and/or missing partitions, leading to incorrect query results.
      • ALTER TABLE RECOVER PARTITIONS, same as REFRESH above.
      • ALTER TABLE <table> DROP PARTITIONS. Impala will only be able to drop the one partition with the the canonical value representation. Other variants of the same partition cannot be dropped.
      • Any other ALTER TABLE ... PARTITION(). Impala will only modify the one partition with the canonical value representation (if any).
      • It is safest to assume that all other metadata statements that operate on a single partition are likewise not functioning as intended.

      Workarounds

      • Ensure that partitions created via Hive do not exhibit ambiguity. Stick to a single partition-key value representation, e.g., use p=0 consistently and avoid variants like p=000.
      • Avoid those statements in Hive that can create the bad metadata. Always use fully dynamic partition inserts and avoid adding partitions via static partition inserts or ALTER TABLE.
      • Running INVALIDATE METADATA <table> will bring Impala's metadata back into a consistent state (including all partition variants). Queries will return correct results, but some DDL operations may still not fully work (like DROP PARTITION).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              alex.behm Alexander Behm
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: