Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2153

RENAME table in Impala does not change the "TABLE_NAME" in the "TAB_COL_STATS" table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Not A Bug
    • Impala 2.2
    • None
    • Clients

    Description

      After executing the rename command in impala-shell, you will lose the table column stats, as the underlining table name stored in the table column stats table in HMS are not updated.

      Please see the following test case:

      -- Create a table, load with some data, initially with no stats
      [10.17.74.162:21000] > CREATE TABLE test (a int);
      Query: create TABLE test (a int)
      
      Fetched 0 row(s) in 0.14s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > INSERT OVERWRITE TABLE test VALUES (1), (2), (3);
      Query: insert OVERWRITE TABLE test VALUES (1), (2), (3)
      Inserted 3 row(s) in 4.84s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW TABLE STATS test;
      Query: show TABLE STATS test
      +-------+--------+------+--------------+--------+-------------------+
      | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
      +-------+--------+------+--------------+--------+-------------------+
      | -1    | 1      | 6B   | NOT CACHED   | TEXT   | false             |
      +-------+--------+------+--------------+--------+-------------------+
      Fetched 1 row(s) in 0.07s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW COLUMN STATS test;
      Query: show COLUMN STATS test
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | a      | INT  | -1               | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      Fetched 1 row(s) in 0.01s
      
      
      
      -- now compute the stats and show stats
      [10.17.74.162:21000] > COMPUTE STATS test;
      Query: compute STATS test
      +-----------------------------------------+
      | summary                                 |
      +-----------------------------------------+
      | Updated 1 partition(s) and 1 column(s). |
      +-----------------------------------------+
      Fetched 1 row(s) in 2.24s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW TABLE STATS test;
      Query: show TABLE STATS test
      +-------+--------+------+--------------+--------+-------------------+
      | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
      +-------+--------+------+--------------+--------+-------------------+
      | 3     | 1      | 6B   | NOT CACHED   | TEXT   | false             |
      +-------+--------+------+--------------+--------+-------------------+
      Fetched 1 row(s) in 0.01s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW COLUMN STATS test;
      Query: show COLUMN STATS test
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | a      | INT  | 3                | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      Fetched 1 row(s) in 0.01s
      
      
      -- Finally rename the table you will see that column stats are lost
      [10.17.74.162:21000] > ALTER TABLE test RENAME TO test_bak;
      Query: alter TABLE test RENAME TO test_bak
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW TABLE STATS test_bak;
      Query: show TABLE STATS test_bak
      +-------+--------+------+--------------+--------+-------------------+
      | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
      +-------+--------+------+--------------+--------+-------------------+
      | 3     | 1      | 6B   | NOT CACHED   | TEXT   | false             |
      +-------+--------+------+--------------+--------+-------------------+
      Fetched 1 row(s) in 5.39s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > SHOW COLUMN STATS test_bak;
      Query: show COLUMN STATS test_bak
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | a      | INT  | -1               | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      Fetched 1 row(s) in 0.02s
      [10.17.74.162:21000] > 
      [10.17.74.162:21000] > COMPUTE STATS test_bak;
      Query: compute STATS test_bak
      +-----------------------------------------+
      | summary                                 |
      +-----------------------------------------+
      | Updated 1 partition(s) and 1 column(s). |
      +-----------------------------------------+
      Fetched 1 row(s) in 0.42s
      
      [10.17.74.162:21000] > SHOW TABLE STATS test_bak;
      Query: show TABLE STATS test_bak
      +-------+--------+------+--------------+--------+-------------------+
      | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
      +-------+--------+------+--------------+--------+-------------------+
      | 3     | 1      | 6B   | NOT CACHED   | TEXT   | false             |
      +-------+--------+------+--------------+--------+-------------------+
      Fetched 1 row(s) in 0.01s
      [10.17.74.162:21000] > show COLUMN STATS test_bak;
      Query: show COLUMN STATS test_bak
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | a      | INT  | 3                | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      Fetched 1 row(s) in 0.01s
      
      

      I also checked the "TAB_COL_STATS" table in HMS DB, see the result:

      -- After initial COMPUTE STATS
      hive=> SELECT "TABLE_NAME", "COLUMN_NAME", "COLUMN_TYPE", "TBL_ID", "NUM_DISTINCTS" FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = 'test' AND "DB_NAME" = 'case_66505';
       TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | NUM_DISTINCTS 
      ------------+-------------+-------------+--------+---------------
       test       | a           | INT         |  34859 |             3
      (1 row)
      
      -- After RENAME, "test" is still there
      hive=> SELECT "TABLE_NAME", "COLUMN_NAME", "COLUMN_TYPE", "TBL_ID", "NUM_DISTINCTS" FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = 'test' AND "DB_NAME" = 'case_66505';
       TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | NUM_DISTINCTS 
      ------------+-------------+-------------+--------+---------------
       test       | a           | INT         |  34859 |             3
      (1 row)
      
      -- After RENAME, can't find "test_bak"
      hive=> SELECT "TABLE_NAME", "COLUMN_NAME", "COLUMN_TYPE", "TBL_ID", "NUM_DISTINCTS" FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = 'test_bak' AND "DB_NAME" = 'case_66505';            
       TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | NUM_DISTINCTS 
      ------------+-------------+-------------+--------+---------------
      (0 rows)
      
      -- After RENAME and COMPUTE STATS on "test_bak"
      hive=> SELECT "TABLE_NAME", "COLUMN_NAME", "COLUMN_TYPE", "TBL_ID", "NUM_DISTINCTS" FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = 'test_bak' AND "DB_NAME" = 'case_66505';
       TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | NUM_DISTINCTS 
      ------------+-------------+-------------+--------+---------------
       test_bak   | a           | INT         |  34859 |             3
      (1 row)
      
      -- In the end you end up with two tables in the stats, one does not exists
      hive=> SELECT "TABLE_NAME", "COLUMN_NAME", "COLUMN_TYPE", "TBL_ID", "NUM_DISTINCTS" FROM "TAB_COL_STATS" WHERE "DB_NAME" = 'case_66505' AND "TABLE_NAME" in ('test', 'test_bak'); 
       TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | NUM_DISTINCTS 
      ------------+-------------+-------------+--------+---------------
       test       | a           | INT         |  34859 |             3
       test_bak   | a           | INT         |  34859 |             3
      (2 rows)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            ericlin Eric Lin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: