[IMPALA-3976] Handle partition-key values with multiple synonymous string representations created in Hive. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
Fix Version/s: None
Component/s: Catalog
Labels:
- correctness
- incompatibility

Target Version:

Product Backlog

Description

For several SQL statements that can create new partitions, Hive seems to generate partition-key values and the corresponding HDFS directory based on the user's string input rather than the corresponding literal value of the appropriate column type. This leads to a situation where a single logical partition-key value can map to multiple HDFS directories and Hive partitions.

Example in Hive:

CREATE TABLE t (i INT) PARTITIONED BY (p INT);
ALTER TABLE t ADD PARTITION (p=0);
ALTER TABLE t ADD PARTITION (p=00);
ALTER TABLE t ADD PARTITION (p=000);
SHOW PARTITIONS t;
p=0
p=00
p=000

The above statements will result in three different HDFS directories, one for each of the "distinct" partitions.

The same result can be achieved with static partition inserts from Hive, instead of ALTER TABLE ADD PARTITION.

Note that Impala will a canonical representation for any partition-key value based on the underlying LiteralExpr, so a similarly strange metadata state cannot be created from Impala, even if given the same input as in the example above.

A special case of this issue was reported in ~~HIVE-6590~~ and ~~IMPALA-3963~~, but the underlying problem is more general.

Issues in Impala
Impala has difficulties dealing with such ambiguous partitions due to the internal assumption that a single assignment of values to partition keys maps to a single Hive partition with a one corresponding HDFS directory.

As long as the cached partition metadata in Impala is correct, queries will return correct results even with partition filters. Impala effectively coalesces the different partition variants, for example, SELECT * FROM t WHERE p=0 will scan all three directories from the example above.

The following statements are known have problems in Impala if such ambiguous partitions exist:

REFRESH <table> and REFRESH <partition>. After such a statement Impala may duplicate and/or missing partitions, leading to incorrect query results.
ALTER TABLE RECOVER PARTITIONS, same as REFRESH above.
ALTER TABLE <table> DROP PARTITIONS. Impala will only be able to drop the one partition with the the canonical value representation. Other variants of the same partition cannot be dropped.
Any other ALTER TABLE ... PARTITION(). Impala will only modify the one partition with the canonical value representation (if any).
It is safest to assume that all other metadata statements that operate on a single partition are likewise not functioning as intended.

Workarounds

Ensure that partitions created via Hive do not exhibit ambiguity. Stick to a single partition-key value representation, e.g., use p=0 consistently and avoid variants like p=000.
Avoid those statements in Hive that can create the bad metadata. Always use fully dynamic partition inserts and avoid adding partitions via static partition inserts or ALTER TABLE.
Running INVALIDATE METADATA <table> will bring Impala's metadata back into a consistent state (including all partition variants). Queries will return correct results, but some DDL operations may still not fully work (like DROP PARTITION).

Attachments

Issue Links

relates to

HIVE-10307 Support to use number literals in partition column

Closed

HIVE-10362 Support Type check/conversion in dynamic partition column

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Alexander Behm

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/Aug/16 03:54

Updated:: 13/Aug/20 21:40

Resolved:: 13/Aug/20 21:40