Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.9.0
Description
Currently hudi can sync the meta data to hive meta store using HiveSyncTool. The table description synced to hive just like this:
CREATE EXTERNAL TABLE `tbl_price_insert0`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` int, `name` string, `price` double, `version` int, `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi/tbl_price_insert0' TBLPROPERTIES ( 'last_commit_time_sync'='20201124105009', 'transient_lastDdlTime'='1606186222')
When we query this table using spark sql, it trait it as a Hive Table, not a spark data source table and convert it to parquet LogicalRelation in HiveStrategies#RelationConversions. As a result, spark sql read the hudi table just like a parquet data source. This lead to an incorrect query result if user missing set the spark.sql.hive.convertMetastoreParquet=false.
Inorder to query hudi table as data source table in spark, more table properties and serde properties must be added to the hive meta,just like the follow:
CREATE EXTERNAL TABLE `tbl_price_cow0`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` int, `name` string, `price` double, `version` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'path'='/tmp/hudi/tbl_price_cow0') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi/tbl_price_cow0' TBLPROPERTIES ( 'last_commit_time_sync'='20201124120532', 'spark.sql.sources.provider'='hudi', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', 'transient_lastDdlTime'='1606190729')
These are the missing table properties:
spark.sql.sources.provider= 'hudi' spark.sql.sources.schema.numParts = 'xx' spark.sql.sources.schema.part.{num} ='xx' spark.sql.sources.schema.numPartCols = 'xx' spark.sql.sources.schema.partCol.{num} = 'xx'
and serde property:
'path'='/path/to/hudi'
Attachments
Issue Links
- links to