During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit
I wrote a unit test to reduce the scope of testing as follows:
- Take an existing parquet file from production dataset (size=690MB, #records=960K)
- Read all the records from this parquet into a JavaRDD
- Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)
The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset.
The time to complete the bulk insert prepped decreased from 680seconds to 380seconds when I reverted the above commit.
Schema details: This HUDI dataset uses a large schema with 51 fields in the record.