Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1779

Fail to bootstrap/upsert a table which contains timestamp column

    XMLWordPrintableJSON

Details

    Description

      current when hudi bootstrap a parquet file, or upsert into a parquet file which contains timestmap column, it will fail because these issues:

      1) At bootstrap operation, if the origin parquet file was written by a spark application, then spark will default save timestamp as int96(see spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github <https://github/>.com/apache/parquet-mr/pull/831/files) 

      2) after bootstrap, doing upsert will fail because we use hoodie schema to read origin parquet file. The schema is not match because hoodie schema  treat timestamp as long and at origin file it’s Int96 

      3) after bootstrap, and partial update for a parquet file will fail, because we copy the old record and save by hoodie schema( we miss a convertFixedToLong operation like spark does)

      Attachments

        1. upsertFail2.png
          73 kB
          lrz
        2. upsertFail.png
          92 kB
          lrz
        3. unsupportInt96.png
          62 kB
          lrz

        Issue Links

          Activity

            People

              guoyihua Ethan Guo
              lrz lrz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: