[HUDI-1415] Read Hoodie Table As Spark DataSource Table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0
Component/s: spark
Labels:
- pull-request-available
- user-support-issues

Description

Currently hudi can sync the meta data to hive meta store using HiveSyncTool. The table description synced to hive just like this:

CREATE EXTERNAL TABLE `tbl_price_insert0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int, 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124105009', 
  'transient_lastDdlTime'='1606186222')

When we query this table using spark sql, it trait it as a Hive Table, not a spark data source table and convert it to parquet LogicalRelation in HiveStrategies#RelationConversions. As a result, spark sql read the hudi table just like a parquet data source. This lead to an incorrect query result if user missing set the spark.sql.hive.convertMetastoreParquet=false.

Inorder to query hudi table as data source table in spark, more table properties and serde properties must be added to the hive meta,just like the follow:

CREATE EXTERNAL TABLE `tbl_price_cow0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='/tmp/hudi/tbl_price_cow0') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124120532', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', 
  'transient_lastDdlTime'='1606190729')

These are the missing table properties:

spark.sql.sources.provider= 'hudi'
spark.sql.sources.schema.numParts = 'xx'
spark.sql.sources.schema.part.{num} ='xx'
spark.sql.sources.schema.numPartCols = 'xx'
spark.sql.sources.schema.partCol.{num} = 'xx'

and serde property:

'path'='/path/to/hudi'

Attachments

Issue Links

links to

GitHub Pull Request #2283

Activity

People

Assignee:: pengzhiwei

Reporter:: pengzhiwei

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Nov/20 04:24

Updated:: 27/Apr/21 03:47

Resolved:: 27/Apr/21 03:47