Details
Description
I noticed some critical changes on my hive tables and realized that they were caused by a simple select on SparkSQL. Looking at the logs, I found out that this select was actually performing an update on the database "Saving case-sensitive schema for table".
I then found out that Spark 2.2.0 introduces a new default value for spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
The issue is that this update changes critical metadata of the table, in particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)
Switching the property to: NEVER_INFER prevents the issue.
Also, note that the damage can be fix manually in Hive with e.g.:
alter table [table_name] clustered by ([col1], [col2]) sorted by ([colA], [colB]) into [n] buckets
REPRODUCE (branch-2.2)
In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch is good due to SPARK-17729. This is a regression on Spark 2.2 only. By default, Parquet Hive table is affected and only Hive may suffer from this.
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) INTO 10 BUCKETS STORED AS PARQUET; hive> INSERT INTO t VALUES('a','b'); hive> DESC FORMATTED t; ... Num Buckets: 10 Bucket Columns: [a, b] Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] scala> sql("SELECT * FROM t").show(false) hive> DESC FORMATTED t; Num Buckets: -1 Bucket Columns: [] Sort Columns: []
Attachments
Issue Links
- is related to
-
SPARK-22329 Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
- Resolved
- relates to
-
SPARK-19611 Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
- Resolved
-
SPARK-20888 Document HiveCaseSensitiveInferenceMode.INFER_AND_SAVE in Spark SQL 2.1 to 2.2 migration notes
- Resolved
- supercedes
-
SPARK-22329 Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
- Resolved
- links to