Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7485

Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • 0.16.0
    • reader-core
    • None

    Description

      Reading Hudi Table using TimestampBasedKeyGenerator and date format 'yyyy-MM-dd' giving Exception 
      Exception 
      ```
      Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
              at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
              at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)
              at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
              at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72)
              at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
              at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264)
              at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314)
              at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
      ```
       
      Code- 
      ```
      columns = ["ts","uuid","rider","driver","fare","dt"]
      data =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"2012-01-01"),
      (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70 ,"2012-01-01"),
      (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90 ,"2012-01-01"),
      (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"2012-01-01")]

      inserts = spark.createDataFrame(data).toDF(*columns)

      hudi_options =

      { 'hoodie.table.name': tableName, 'hoodie.datasource.write.recordkey.field' : 'uuid', 'hoodie.datasource.write.precombine.field' : 'ts', 'hoodie.datasource.write.partitionpath.field': 'dt', 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled' : 'true', 'hoodie.datasource.write.keygenerator.class' : 'org.apache.hudi.keygen.TimestampBasedKeyGenerator', 'hoodie.keygen.timebased.timestamp.type' : 'SCALAR', 'hoodie.keygen.timebased.timestamp.scalar.time.unit' : 'DAYS', 'hoodie.keygen.timebased.input.dateformat' : 'yyyy-MM-dd', 'hoodie.keygen.timebased.output.dateformat' : 'yyyy-MM-dd', 'hoodie.keygen.timebased.timezone' : 'GMT+8:00', 'hoodie.datasource.write.hive_style_partitioning' : 'true', }
      1. Insert data
        inserts.withColumn("dt", expr("CAST(dt as date)")).write.format("hudi"). \
        options(**hudi_options). \
        mode("overwrite"). \
        save(basePath)

      deleteDF=spark.read.format("hudi").load(basePath)
      deleteDF.show()
      ```

      Attachments

        Activity

          People

            Unassigned Unassigned
            adityagoenka Aditya Goenka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: