Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5615

Compute Incremental stats is broken for general partition expressions

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
    • Fix Version/s: Impala 2.10.0
    • Component/s: Frontend
    • Labels:

      Description

      It turns out that the logic is ComputeStatsStmt#analyze() doesn't work well with general partition expressions. A simple repro for it is as follows,

      1) Prepare test data:
      
      create table pp(c int) partitioned by (p1 int, p2 int);
      insert into pp partition (p1=10, p2) select 1, 1;
      insert into pp partition (p1=10, p2) select 2,2;
      
      2) Generate correct stats:
      compute stats pp;
      show table stats pp;
      
      Query: show table stats pp
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      | p1    | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                            |
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      | 10    | 1  | 1     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
      | 10    | 2  | 1     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
      | Total |    | 0     | 2      | 4B   | 0B           |                   |        |                   |                                                     |
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      Fetched 3 row(s) in 0.02s
      
      3) Reproduce the issue:
      compute incremental stats pp partition (p1=10);
      show table stats pp;
      
      Query: show table stats pp
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      | p1    | p2 | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                            |
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      | 10    | 1  | 0     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=1 |
      | 10    | 2  | 0     | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | true              | hdfs://localhost:20500/test-warehouse/pp/p1=10/p2=2 |
      | Total |    | 0     | 2      | 4B   | 0B           |                   |        |                   |                                                     |
      +-------+----+-------+--------+------+--------------+-------------------+--------+-------------------+-----------------------------------------------------+
      Fetched 3 row(s) in 0.01s
      

      The bug is in the child queries generated by the incremental stats query.

      SELECT NDV_NO_FINALIZE(c) AS c, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE), COUNT(c), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2	
      
      SELECT COUNT(*), p1, p2 FROM pp WHERE ((p1=10 AND p2=1) AND (p1=10 AND p2=2)) GROUP BY p1, p2
      

      Specifically, the problem is in the filter predicate generated. ((p1=10 AND p2=1) AND (p1=10 AND p2=2)). It turns out that the ComputeStats#analyze() is broken due to IMPALA-1654 and we need to rewrite the logic to support general partition expressions based on PartitionSet.

      Workaround: Don't use general partition expressions and instead use a full partition spec, i.e., run the compute incremental stats for one partition at a time.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bharathv bharath v
                Reporter:
                bharathv bharath v
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: