Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4767

Table stats are removed after any ALTER TABLE in Impala

    Details

      Description

      According to docs "https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html" under section "Setting Column Stats Manually through ALTER TABLE" we can manually set column stats. This was introduced as part of IMPALA-3369

      However, when setting column stats manually, table level stats seem to be removed.

      To reproduce:
      Create a table in hive with a single column.

      hive> create table t(c int);
      

      Insert 1 row of data using hive:

      hive> insert into t values (1);
      

      Compute table level stats using hive:

      hive> analyze table t compute statistics;
      

      Running describe formatted in hive should show 1 row for numRows:

      hive> describe formatted d_level;
      # col_name            	data_type           	comment             	 	 
      c                   	int                 	                    
      	 	 
      # Detailed Table Information	 	 
      Table Type:         	MANAGED_TABLE       	 
      Table Parameters:	 	                
      	numFiles            	1                   
      	numRows             	1                   
      	rawDataSize         	1                   
      	totalSize           	12                  
      	transient_lastDdlTime	1484319025   
      

      Running show table stats in impala should show the same value of 1 for #Rows:

      [impala:21000] > show table stats t;
      Query: show table stats t
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                               |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | 1     | 1      | 12B  | NOT CACHED   | NOT CACHED        | TEXT   | false             | hdfs://....t |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      

      Now manually set column stats on column 'c':

      [impala:21000] > alter table t set column stats c ('numdvs'='1');
      

      View the column stats and see that '#Distinct Values' is now set to 1

      [impala:21000] > show column stats t;
      Query: show column stats t
      +--------+------+------------------+--------+----------+----------+
      | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
      +--------+------+------------------+--------+----------+----------+
      | c      | INT  | 1                | -1     | 4        | 4        |
      +--------+------+------------------+--------+----------+----------+
      

      But we now seem to have lost the table level stats. Show table stats in impala now says -1 for #Rows:

      [impala:21000] > show table stats t;
      Query: show table stats t
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                               |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      | -1    | 1      | 12B  | NOT CACHED   | NOT CACHED        | TEXT   | false             | hdfs://...t |
      +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
      

      Describe formatted in hive reports -1 for numRows:

      hive> describe formatted t;
      OK
      # col_name            	data_type           	comment             	 	 
      c                   	int                 	                    
      	 	 
      # Detailed Table Information	 	 	 
      Table Type:         	MANAGED_TABLE       	 
      Table Parameters:	 	 
      	COLUMN_STATS_ACCURATE	false               
      	numFiles            	1                   
      	numRows             	-1                  
      	rawDataSize         	-1                  
      	totalSize           	12                  
      	transient_lastDdlTime	1484319616          
      

      This causes problems for any application (such as hive and impala) which rely on these table level stats.

      Workaround:
      Recompute the table level stats again in hive using:

      analyze table t1 compute statistics;
      

        Issue Links

          Activity

          Hide
          alex.behm Alexander Behm added a comment -

          This fixed a few issues with the original patch: IMPALA-5954

          Show
          alex.behm Alexander Behm added a comment - This fixed a few issues with the original patch: IMPALA-5954
          Hide
          jbapple Jim Apple added a comment -

          This is a bulk comment on all issues with Fix Version 2.8.0 that were resolved on or after 2016-12-09.

          2.8.0 was branched on December 9, with only two changes to master cherry-picked to the 2.8.0 release branch after that:

          https://github.com/apache/incubator-impala/commits/2.8.0

          Issues fixed after December 9 might not be fixed in 2.8.0. If you are the one who marked this issue Resolved, can you check to see if the patch is in 2.8.0 by using the link above? If the patch is not in 2.8.0, can you change the Fix Version to 2.9.0?

          Thank you!

          Show
          jbapple Jim Apple added a comment - This is a bulk comment on all issues with Fix Version 2.8.0 that were resolved on or after 2016-12-09. 2.8.0 was branched on December 9, with only two changes to master cherry-picked to the 2.8.0 release branch after that: https://github.com/apache/incubator-impala/commits/2.8.0 Issues fixed after December 9 might not be fixed in 2.8.0. If you are the one who marked this issue Resolved, can you check to see if the patch is in 2.8.0 by using the link above? If the patch is not in 2.8.0, can you change the Fix Version to 2.9.0? Thank you!
          Hide
          alex.behm Alexander Behm added a comment -

          commit 743873005225c55240de7181eea7bb438a260ff1
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Tue Jan 17 18:34:08 2017 -0800

          IMPALA-4767: Workaround for HIVE-15653 to preserve table stats.

          HIVE-15653 is a Hive Metastore bug that results in ALTER TABLE
          commands wiping the table stats of unpartitioned tables.

          Until the Hive bug is fixed, this patch adds a workaround
          to Impala that forces the Metastore to preserve the table stats.

          Testing: Private core/hdfs run passed.

          Change-Id: Ic191c765f73624bc716badadd7215c8dca9d6b1f
          Reviewed-on: http://gerrit.cloudera.org:8080/5731
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 743873005225c55240de7181eea7bb438a260ff1 Author: Alex Behm <alex.behm@cloudera.com> Date: Tue Jan 17 18:34:08 2017 -0800 IMPALA-4767 : Workaround for HIVE-15653 to preserve table stats. HIVE-15653 is a Hive Metastore bug that results in ALTER TABLE commands wiping the table stats of unpartitioned tables. Until the Hive bug is fixed, this patch adds a workaround to Impala that forces the Metastore to preserve the table stats. Testing: Private core/hdfs run passed. Change-Id: Ic191c765f73624bc716badadd7215c8dca9d6b1f Reviewed-on: http://gerrit.cloudera.org:8080/5731 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          alex.behm Alexander Behm added a comment -

          Patch for the workaround in Impala: http://gerrit.cloudera.org:8080/5731

          Show
          alex.behm Alexander Behm added a comment - Patch for the workaround in Impala: http://gerrit.cloudera.org:8080/5731
          Hide
          alex.behm Alexander Behm added a comment -

          Also note that this issue is specific to unpartitioned tables. Partitioned tables work ok.

          Show
          alex.behm Alexander Behm added a comment - Also note that this issue is specific to unpartitioned tables. Partitioned tables work ok.
          Hide
          alex.behm Alexander Behm added a comment -

          Filed a Hive JIRA for this bug: https://issues.apache.org/jira/browse/HIVE-15653

          I have a found a way to workaround the problem in Impala. I'll post a patch shortly.

          Show
          alex.behm Alexander Behm added a comment - Filed a Hive JIRA for this bug: https://issues.apache.org/jira/browse/HIVE-15653 I have a found a way to workaround the problem in Impala. I'll post a patch shortly.
          Hide
          alex.behm Alexander Behm added a comment -

          Mostafa Mokhtar, no not a dupe. I can repro on master and on CDH5.8 CDH5.9 and CDH5.10. In this case the table stats are wiped. IMPALA-4260 was about column stats being wiped.

          Show
          alex.behm Alexander Behm added a comment - Mostafa Mokhtar , no not a dupe. I can repro on master and on CDH5.8 CDH5.9 and CDH5.10. In this case the table stats are wiped. IMPALA-4260 was about column stats being wiped.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Alexander Behm
          Is this not a duplicate of IMPALA-4260?

          Show
          mmokhtar Mostafa Mokhtar added a comment - Alexander Behm Is this not a duplicate of IMPALA-4260 ?
          Hide
          nbrenwald Nicholas Brenwald added a comment -

          Sorry, should have hit refresh before posting previous comment. I confirm as you said, teems to impact any alter table statement, not just SET COLUMN STATS.

          Show
          nbrenwald Nicholas Brenwald added a comment - Sorry, should have hit refresh before posting previous comment. I confirm as you said, teems to impact any alter table statement, not just SET COLUMN STATS.
          Hide
          nbrenwald Nicholas Brenwald added a comment -

          We are seeing a similar issue when setting tblproperties. Have raised IMPALA-4776, but its potentially duplicating this one and the same root cause.

          Show
          nbrenwald Nicholas Brenwald added a comment - We are seeing a similar issue when setting tblproperties. Have raised IMPALA-4776 , but its potentially duplicating this one and the same root cause.
          Hide
          alex.behm Alexander Behm added a comment -

          Unfortunately, it turns out that the table stats are wiped by any ALTER TABLE, not just ALTER TABLE SET COLUMN STATS.
          This looks like a bug in the Hive Metastore, It can be reproduced without involving Impala, e.g. only with Hive:

          analyze table t compute statistics;
          
          # Execute a random ALTER TABLE that is unrelated to stats.
          alter table t set tblproperties('custom'='y');
          
          # Table stats are reset to -1
          describe extended t;
          i                   	int                 	                    
          	 	 
          Detailed Table Information	Table(tableName:t, dbName:default, owner:abehm, createTime:1484688606, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:i, type:int, comment:null)], location:hdfs://localhost:20500/test-warehouse/t, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:0, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, last_modified_by=abehm, last_modified_time=1484689489, transient_lastDdlTime=1484689489, COLUMN_STATS_ACCURATE=false, totalSize=2, numRows=-1, rawDataSize=-1, custom=y}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)	
          Time taken: 0.103 seconds, Fetched: 3 row(s)
          

          I'm still investigating what exactly the problem is and whether Impala can implement a workaround. It looks suspicious that "GENERATED_BY_STATS_TASK" entry is missing in the table properties map.

          Show
          alex.behm Alexander Behm added a comment - Unfortunately, it turns out that the table stats are wiped by any ALTER TABLE, not just ALTER TABLE SET COLUMN STATS. This looks like a bug in the Hive Metastore, It can be reproduced without involving Impala, e.g. only with Hive: analyze table t compute statistics; # Execute a random ALTER TABLE that is unrelated to stats. alter table t set tblproperties('custom'='y'); # Table stats are reset to -1 describe extended t; i int Detailed Table Information Table(tableName:t, dbName: default , owner:abehm, createTime:1484688606, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:i, type: int , comment: null )], location:hdfs: //localhost:20500/test-warehouse/t, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed: false , numBuckets:0, serdeInfo:SerDeInfo(name: null , serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories: false ), partitionKeys:[], parameters:{numFiles=1, last_modified_by=abehm, last_modified_time=1484689489, transient_lastDdlTime=1484689489, COLUMN_STATS_ACCURATE= false , totalSize=2, numRows=-1, rawDataSize=-1, custom=y}, viewOriginalText: null , viewExpandedText: null , tableType:MANAGED_TABLE) Time taken: 0.103 seconds, Fetched: 3 row(s) I'm still investigating what exactly the problem is and whether Impala can implement a workaround. It looks suspicious that "GENERATED_BY_STATS_TASK" entry is missing in the table properties map.
          Hide
          alex.behm Alexander Behm added a comment -

          I confirmed that this also reproduces on master.

          Show
          alex.behm Alexander Behm added a comment - I confirmed that this also reproduces on master.
          Hide
          alex.behm Alexander Behm added a comment -

          Thanks for filing this report!

          Show
          alex.behm Alexander Behm added a comment - Thanks for filing this report!

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              nbrenwald Nicholas Brenwald
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development