Description
We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:
spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS; 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. spark-sql> DESC EXTENDED t; c1 int NULL c2 int NULL # Detailed Table Information Database default Table t Owner yumwang Created Time Mon Apr 29 17:52:05 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.0 Type MANAGED Provider parquet Num Buckets 2 Bucket Columns [`c1`] Sort Columns [`c1`] Table Properties [transient_lastDdlTime=1556531525] Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1]
We can see incompatible information when creating the table:
WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.
Attachments
Issue Links
- is duplicated by
-
SPARK-29234 bucketed table created by Spark SQL DataFrame is in SequenceFile format
- Resolved
- is related to
-
SPARK-25102 Write Spark version to ORC/Parquet file metadata
- Resolved
- links to