Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22306

INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
      Spark 2.2.0

      Description

      I noticed some critical changes on my hive tables and realized that they were caused by a simple select on SparkSQL. Looking at the logs, I found out that this select was actually performing an update on the database "Saving case-sensitive schema for table".
      I then found out that Spark 2.2.0 introduces a new default value for spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

      The issue is that this update changes critical metadata of the table, in particular:

      • changes the owner to the current user
      • removes bucketing metadata (BUCKETING_COLS, SDS)
      • removes sorting metadata (SORT_COLS)

      Switching the property to: NEVER_INFER prevents the issue.

      Also, note that the damage can be fix manually in Hive with e.g.:

      alter table [table_name] 
      clustered by ([col1], [col2]) 
      sorted by ([colA], [colB])
      into [n] buckets
      

      REPRODUCE (branch-2.2)
      In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch is good due to SPARK-17729. This is a regression on Spark 2.2 only. By default, Parquet Hive table is affected and only Hive may suffer from this.

      hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) INTO 10 BUCKETS STORED AS PARQUET;
      hive> INSERT INTO t VALUES('a','b');
      hive> DESC FORMATTED t;
      ...
      Num Buckets:        	10
      Bucket Columns:     	[a, b]
      Sort Columns:       	[Order(col:a, order:1), Order(col:b, order:1)]
      
      scala> sql("SELECT * FROM t").show(false)
      
      hive> DESC FORMATTED t;
      Num Buckets:        	-1
      Bucket Columns:     	[]
      Sort Columns:       	[]
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                cloud_fan Wenchen Fan
                Reporter:
                WhoisDavid David Malinge
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: