[SPARK-27592] Set the bucketed data source table SerDe correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:

spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 spark-sql> DESC EXTENDED t;
 c1 int NULL
 c2 int NULL
 # Detailed Table Information
 Database default
 Table t
 Owner yumwang
 Created Time Mon Apr 29 17:52:05 CST 2019
 Last Access Thu Jan 01 08:00:00 CST 1970
 Created By Spark 2.4.0
 Type MANAGED
 Provider parquet
 Num Buckets 2
 Bucket Columns [`c1`]
 Sort Columns [`c1`]
 Table Properties [transient_lastDdlTime=1556531525]
 Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
 OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 Storage Properties [serialization.format=1]

We can see incompatible information when creating the table:

WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.

Attachments

Issue Links

is duplicated by

SPARK-29234 bucketed table created by Spark SQL DataFrame is in SequenceFile format

Resolved

is related to

SPARK-25102 Write Spark version to ORC/Parquet file metadata

Resolved

links to

GitHub Pull Request #24486

GitHub Pull Request #25591

Activity

People

Assignee:: Yuming Wang

Reporter:: Yuming Wang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Apr/19 09:45

Updated:: 28/Oct/19 19:44

Resolved:: 15/Aug/19 12:59