Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1415

Read Hoodie Table As Spark DataSource Table

    XMLWordPrintableJSON

Details

    Description

       Currently hudi can sync the meta data to hive meta store using HiveSyncTool. The table description  synced to hive  just like this:

      CREATE EXTERNAL TABLE `tbl_price_insert0`(
        `_hoodie_commit_time` string, 
        `_hoodie_commit_seqno` string, 
        `_hoodie_record_key` string, 
        `_hoodie_partition_path` string, 
        `_hoodie_file_name` string, 
        `id` int, 
        `name` string, 
        `price` double,
        `version` int, 
        `dt` string)
      ROW FORMAT SERDE 
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      STORED AS INPUTFORMAT 
        'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
      OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      LOCATION
        'file:/tmp/hudi/tbl_price_insert0'
      TBLPROPERTIES (
        'last_commit_time_sync'='20201124105009', 
        'transient_lastDdlTime'='1606186222')
      

      When we query this table using spark sql, it trait it as a Hive Table, not a spark data source table and convert it to parquet LogicalRelation in HiveStrategies#RelationConversions. As a result, spark sql read the hudi table just like a parquet data source.  This lead to an incorrect query result if user missing set the spark.sql.hive.convertMetastoreParquet=false.

      Inorder to query hudi table as data source table in spark, more table properties and serde properties must be added to the hive meta,just like the follow:

      CREATE EXTERNAL TABLE `tbl_price_cow0`(
        `_hoodie_commit_time` string, 
        `_hoodie_commit_seqno` string, 
        `_hoodie_record_key` string, 
        `_hoodie_partition_path` string, 
        `_hoodie_file_name` string, 
        `id` int, 
        `name` string, 
        `price` double,
        `version` int)
      ROW FORMAT SERDE 
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      WITH SERDEPROPERTIES ( 
        'path'='/tmp/hudi/tbl_price_cow0') 
      STORED AS INPUTFORMAT 
        'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
      OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      LOCATION
        'file:/tmp/hudi/tbl_price_cow0'
      TBLPROPERTIES (
        'last_commit_time_sync'='20201124120532', 
        'spark.sql.sources.provider'='hudi', 
        'spark.sql.sources.schema.numParts'='1', 
        'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', 
        'transient_lastDdlTime'='1606190729')
      

      These are the missing table properties:

      spark.sql.sources.provider= 'hudi'
      spark.sql.sources.schema.numParts = 'xx'
      spark.sql.sources.schema.part.{num} ='xx'
      spark.sql.sources.schema.numPartCols = 'xx'
      spark.sql.sources.schema.partCol.{num} = 'xx'

      and serde property:

      'path'='/path/to/hudi'
      

      Attachments

        Issue Links

          Activity

            People

              pzw2018 pengzhiwei
              pzw2018 pengzhiwei
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: