Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4767

Table stats are removed after any ALTER TABLE in Impala

    XMLWordPrintableJSON

Details

    Description

      According to docs "https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html" under section "Setting Column Stats Manually through ALTER TABLE" we can manually set column stats. This was introduced as part of IMPALA-3369

      However, when setting column stats manually, table level stats seem to be removed.

      To reproduce:
      Create a table in hive with a single column.

      hive> create table t(c int);
      

      Insert 1 row of data using hive:

      hive> insert into t values (1);
      

      Compute table level stats using hive:

      hive> analyze table t compute statistics;
      

      Running describe formatted in hive should show 1 row for numRows:

      hive> describe formatted d_level;
      # col_name            	data_type           	comment             	 	 
      c                   	int                 	                    
      	 	 
      # Detailed Table Information	 	 
      Table Type:         	MANAGED_TABLE       	 
      Table Parameters:	 	                
      	numFiles            	1                   
      	numRows             	1                   
      	rawDataSize         	1                   
      	totalSize           	12                  
      	transient_lastDdlTime	1484319025   
      

      Running show table stats in impala should show the same value of 1 for #Rows:

      [impala:21000] > show table stats t;
      Query: show table stats t
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                               |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | 1     | 1      | 12B  | NOT CACHED   | NOT CACHED        | TEXT   | false             | hdfs://....t |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      

      Now manually set column stats on column 'c':

      [impala:21000] > alter table t set column stats c ('numdvs'='1');
      

      View the column stats and see that '#Distinct Values' is now set to 1

      [impala:21000] > show column stats t;
      Query: show column stats t
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | c      | INT  | 1                | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      

      But we now seem to have lost the table level stats. Show table stats in impala now says -1 for #Rows:

      [impala:21000] > show table stats t;
      Query: show table stats t
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                               |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | -1    | 1      | 12B  | NOT CACHED   | NOT CACHED        | TEXT   | false             | hdfs://...t |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      

      Describe formatted in hive reports -1 for numRows:

      hive> describe formatted t;
      OK
      # col_name            	data_type           	comment             	 	 
      c                   	int                 	                    
      	 	 
      # Detailed Table Information	 	 	 
      Table Type:         	MANAGED_TABLE       	 
      Table Parameters:	 	 
      	COLUMN_STATS_ACCURATE	false               
      	numFiles            	1                   
      	numRows             	-1                  
      	rawDataSize         	-1                  
      	totalSize           	12                  
      	transient_lastDdlTime	1484319616          
      

      This causes problems for any application (such as hive and impala) which rely on these table level stats.

      Workaround:
      Recompute the table level stats again in hive using:

      analyze table t1 compute statistics;
      

      Attachments

        Issue Links

          Activity

            People

              alex.behm Alexander Behm
              nbrenwald Nicholas Brenwald
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: