Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
Description
schemacommit file grows with every commit even without any schema change, as it keeps all the historical versions. At one point the job starts failing with OOM exception due to this.
Below is the reproducible code -
```
basePath = "file:///tmp/hudi_cow_read"
streamingTableName = "hudi_trips_cow_streaming"
baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"
hudi_streaming_options =
{ 'hoodie.table.name': streamingTableName, 'hoodie.datasource.write.recordkey.field' : 'uuid', 'hoodie.datasource.write.precombine.field' : 'ts', 'hoodie.datasource.write.partitionpath.field': 'city', 'hoodie.datasource.write.table.name': streamingTableName, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.schema.on.read.enable' : 'true', 'hoodie.datasource.write.reconcile.schema' : 'true', 'hoodie.datasource.write.drop.partition.columns' : 'true', 'hoodie.datasource.write.hive_style_partitioning' : 'true' }- create streaming df
df = spark.readStream \
.format("hudi").option("hoodie.datasource.read.incr.fallback.fulltablescan.enable", "true") \
.load(basePath).select("ts","uuid","rider","driver","fare","city")
- write stream to new hudi table
df.writeStream.format("hudi") \
.options(**hudi_streaming_options) \
.outputMode("append") \
.option("path", baseStreamingPath) \
.option("checkpointLocation", checkpointLocation) \
.trigger(processingTime='10 seconds') \
.start() \
.awaitTermination()
```
Github Issue - https://github.com/apache/hudi/issues/10816