[SPARK-22306] INFER_AND_SAVE overwrites important metadata in Parquet Metastore table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.2.1, 2.3.0
Component/s: SQL
Labels:
None
Environment:

Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
Spark 2.2.0

Description

I noticed some critical changes on my hive tables and realized that they were caused by a simple select on SparkSQL. Looking at the logs, I found out that this select was actually performing an update on the database "Saving case-sensitive schema for table".
I then found out that Spark 2.2.0 introduces a new default value for spark.sql.hive.caseSensitiveInferenceMode (see ~~SPARK-20888~~): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in particular:

changes the owner to the current user
removes bucketing metadata (BUCKETING_COLS, SDS)
removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:

alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets

REPRODUCE (branch-2.2)
In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch is good due to ~~SPARK-17729~~. This is a regression on Spark 2.2 only. By default, Parquet Hive table is affected and only Hive may suffer from this.

hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:        	10
Bucket Columns:     	[a, b]
Sort Columns:       	[Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("SELECT * FROM t").show(false)

hive> DESC FORMATTED t;
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]

Attachments

Issue Links

is related to

SPARK-22329 Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

Resolved

relates to

SPARK-19611 Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

Resolved

SPARK-20888 Document HiveCaseSensitiveInferenceMode.INFER_AND_SAVE in Spark SQL 2.1 to 2.2 migration notes

Resolved

supercedes

SPARK-22329 Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

Resolved

links to

[Github] Pull Request #19622 (cloud-fan)

[Github] Pull Request #19644 (cloud-fan)

GitHub Pull Request #19622

(2 links to)

Activity

People

Assignee:: Wenchen Fan

Reporter:: David Malinge

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Oct/17 14:18

Updated:: 01/Jun/19 02:14

Resolved:: 02/Nov/17 11:38