Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7481

schemacommit file increases with every commit and ultimately failing with OOM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • 1.1.0
    • writer-core
    • None

    Description

      schemacommit file grows with every commit even without any schema change, as it keeps all the historical versions. At one point the job starts failing with OOM exception due to this.

       

      Below is the reproducible code - 

      ```
      basePath = "file:///tmp/hudi_cow_read"
      streamingTableName = "hudi_trips_cow_streaming"
      baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
      checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"

      hudi_streaming_options =

      { 'hoodie.table.name': streamingTableName, 'hoodie.datasource.write.recordkey.field' : 'uuid', 'hoodie.datasource.write.precombine.field' : 'ts', 'hoodie.datasource.write.partitionpath.field': 'city', 'hoodie.datasource.write.table.name': streamingTableName, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.schema.on.read.enable' : 'true', 'hoodie.datasource.write.reconcile.schema' : 'true', 'hoodie.datasource.write.drop.partition.columns' : 'true', 'hoodie.datasource.write.hive_style_partitioning' : 'true' }
      1. create streaming df
        df = spark.readStream \
        .format("hudi").option("hoodie.datasource.read.incr.fallback.fulltablescan.enable", "true") \
        .load(basePath).select("ts","uuid","rider","driver","fare","city")
      1. write stream to new hudi table
        df.writeStream.format("hudi") \
        .options(**hudi_streaming_options) \
        .outputMode("append") \
        .option("path", baseStreamingPath) \
        .option("checkpointLocation", checkpointLocation) \
        .trigger(processingTime='10 seconds') \
        .start() \
        .awaitTermination()
        ```
        Github Issue - https://github.com/apache/hudi/issues/10816

      Attachments

        Activity

          People

            Unassigned Unassigned
            adityagoenka Aditya Goenka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: