[HUDI-1509] Major performance degradation due to rewriting records with default values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.6.0, 0.7.0, 0.8.0
Fix Version/s: 0.7.0
Component/s: None
Labels:
- pull-request-available

Description

During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit

[HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

I wrote a unit test to reduce the scope of testing as follows:

Take an existing parquet file from production dataset (size=690MB, #records=960K)
Read all the records from this parquet into a JavaRDD
Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset.

The time to complete the bulk insert prepped decreased from 680seconds to 380seconds when I reverted the above commit.

Schema details: This HUDI dataset uses a large schema with 51 fields in the record.

Attachments

Issue Links

links to

GitHub Pull Request #2424

Activity

People

Assignee:: Nishith Agarwal

Reporter:: Prashant Wason

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Jan/21 00:47

Updated:: 02/Feb/21 14:47

Resolved:: 02/Feb/21 14:47