[SPARK-21031] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.

spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name	data_type	comment
i	string	NULL
j	string	NULL
# Detailed Table Information		
Database	default	
Table	xx	
Owner	wzh	
Created	Thu Jun 08 18:30:46 PDT 2017	
Last Access	Wed Dec 31 16:00:00 PST 1969	
Type	MANAGED	
Provider	hive	
Properties	[serialization.format=1]	
Statistics	4 bytes	
Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
InputFormat	org.apache.hadoop.mapred.TextInputFormat	
OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
Partition Provider	Catalog	
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds

spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds

spark-sql> desc formatted xx;
# col_name	data_type	comment
i	string	NULL
j	string	NULL
# Detailed Table Information		
Database	default	
Table	xx	
Owner	wzh	
Created	Thu Jun 08 18:30:46 PDT 2017	
Last Access	Wed Dec 31 16:00:00 PST 1969	
Type	MANAGED	
Provider	hive	
Properties	[serialization.format=1]	
Statistics	4 bytes	(-- This should be 8 bytes)
Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
InputFormat	org.apache.hadoop.mapred.TextInputFormat	
OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
Partition Provider	Catalog	
Time taken: 0.077 seconds, Fetched 19 row(s)

Attachments

Issue Links

links to

[Github] Pull Request #18248 (wzhfy)

Activity

People

Assignee:: Zhenhua Wang

Reporter:: Zhenhua Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jun/17 06:13

Updated:: 12/Jun/17 00:24

Resolved:: 12/Jun/17 00:24