Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
-
ghx-label-4
Description
It turns out that the logic is ComputeStatsStmt#analyze() doesn't work well with general partition expressions. A simple repro for it is as follows,
1) Prepare test data: create table pp(c int) partitioned by (p1 int, p2 int); insert into pp partition (p1=10, p2) select 1, 1; insert into pp partition (p1=10, p2) select 2,2; 2) Generate correct stats: compute stats pp; show table stats pp; Query: show table stats pp +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ | p1 | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ | 10 | 1 | 1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 | | 10 | 2 | 1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 | | Total | | 0 | 2 | 4B | 0B | | | | | +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ Fetched 3 row(s) in 0.02s 3) Reproduce the issue: compute incremental stats pp partition (p1=10); show table stats pp; Query: show table stats pp +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ | p1 | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ | 10 | 1 | 0 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 | | 10 | 2 | 0 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT | true | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 | | Total | | 0 | 2 | 4B | 0B | | | | | +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+ Fetched 3 row(s) in 0.01s
The bug is in the child queries generated by the incremental stats query.
SELECT NDV_NO_FINALIZE(c) AS c, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE), COUNT(c), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2 SELECT COUNT(*), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2
Specifically, the problem is in the filter predicate generated. ((p1=10 AND p2=1) AND (p1=10 AND p2=2)). It turns out that the ComputeStats#analyze() is broken due to IMPALA-1654 and we need to rewrite the logic to support general partition expressions based on PartitionSet.
Workaround: Don't use general partition expressions and instead use a full partition spec, i.e., run the compute incremental stats for one partition at a time.
Attachments
Issue Links
- is broken by
-
IMPALA-1654 Impala needs to support all operators in drop partitions (<, >, <>, !=, <=, >=) like hive does
- Resolved
- is duplicated by
-
IMPALA-6620 Compute incremental stats for groups of partitions does not update stats correctly
- Resolved