Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16026 Cost-based Optimizer Framework
  3. SPARK-21031

Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • SQL
    • None

    Description

      Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.

      For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

      Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.

      spark-sql> create table xx(i string, j string);
      spark-sql> insert into table xx select 'a', 'b';
      
      spark-sql> desc formatted xx;
      # col_name	data_type	comment
      i	string	NULL
      j	string	NULL
      # Detailed Table Information		
      Database	default	
      Table	xx	
      Owner	wzh	
      Created	Thu Jun 08 18:30:46 PDT 2017	
      Last Access	Wed Dec 31 16:00:00 PST 1969	
      Type	MANAGED	
      Provider	hive	
      Properties	[serialization.format=1]	
      Statistics	4 bytes	
      Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
      Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
      InputFormat	org.apache.hadoop.mapred.TextInputFormat	
      OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
      Partition Provider	Catalog	
      Time taken: 0.089 seconds, Fetched 19 row(s)
      
      spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
      Time taken: 0.187 seconds
      
      spark-sql> insert into table xx select 'c', 'd';
      Time taken: 0.583 seconds
      
      spark-sql> desc formatted xx;
      # col_name	data_type	comment
      i	string	NULL
      j	string	NULL
      # Detailed Table Information		
      Database	default	
      Table	xx	
      Owner	wzh	
      Created	Thu Jun 08 18:30:46 PDT 2017	
      Last Access	Wed Dec 31 16:00:00 PST 1969	
      Type	MANAGED	
      Provider	hive	
      Properties	[serialization.format=1]	
      Statistics	4 bytes	(-- This should be 8 bytes)
      Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
      Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
      InputFormat	org.apache.hadoop.mapred.TextInputFormat	
      OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
      Partition Provider	Catalog	
      Time taken: 0.077 seconds, Fetched 19 row(s)
      

      Attachments

        Activity

          People

            ZenWzh Zhenhua Wang
            ZenWzh Zhenhua Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: