[HUDI-7485] Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: reader-core
Labels:
None

Description

Reading Hudi Table using TimestampBasedKeyGenerator and date format 'yyyy-MM-dd' giving Exception
Exception
```
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264)
at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
```

Code-
```
columns = ["ts","uuid","rider","driver","fare","dt"]
data =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"2012-01-01"),
(1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70 ,"2012-01-01"),
(1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90 ,"2012-01-01"),
(1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"2012-01-01")]

inserts = spark.createDataFrame(data).toDF(*columns)

hudi_options =

{ 'hoodie.table.name': tableName, 'hoodie.datasource.write.recordkey.field' : 'uuid', 'hoodie.datasource.write.precombine.field' : 'ts', 'hoodie.datasource.write.partitionpath.field': 'dt', 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled' : 'true', 'hoodie.datasource.write.keygenerator.class' : 'org.apache.hudi.keygen.TimestampBasedKeyGenerator', 'hoodie.keygen.timebased.timestamp.type' : 'SCALAR', 'hoodie.keygen.timebased.timestamp.scalar.time.unit' : 'DAYS', 'hoodie.keygen.timebased.input.dateformat' : 'yyyy-MM-dd', 'hoodie.keygen.timebased.output.dateformat' : 'yyyy-MM-dd', 'hoodie.keygen.timebased.timezone' : 'GMT+8:00', 'hoodie.datasource.write.hive_style_partitioning' : 'true', }

Insert data
inserts.withColumn("dt", expr("CAST(dt as date)")).write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)

deleteDF=spark.read.format("hudi").load(basePath)
deleteDF.show()
```

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Aditya Goenka

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Mar/24 10:26

Updated:: 11/Jun/24 01:31