Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
0.6.0, 0.7.0, 0.8.0
-
None
Description
During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit
I wrote a unit test to reduce the scope of testing as follows:
- Take an existing parquet file from production dataset (size=690MB, #records=960K)
- Read all the records from this parquet into a JavaRDD
- Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)
The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset.
The time to complete the bulk insert prepped decreased from 680seconds to 380seconds when I reverted the above commit.
Schema details: This HUDI dataset uses a large schema with 51 fields in the record.
Attachments
Issue Links
- links to