[HUDI-1054] Address performance issues with finalizing writes on S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0
Component/s: bootstrap, Common Core, performance
Labels:
- pull-request-available

Description

I have identified 3 performance bottleneck in the finalizeWrite function, that are manifesting and becoming more prominent with the new bootstrap mechanism on S3:

https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425 is a serial operation performed at the driver and it can take a long time when you have several partitions and large number of files.
The invalid data paths are being stored in a List instead of Set and as a result the following operation becomes N^2 taking significant time to compute at the driver: https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429
https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473 does a recursive delete of the marker directory at the driver. This is again extremely expensive when you have large number of partitions and files.

Upon testing with a 1 TB data set, having 8000 partitions and approximately 190000 files this whole process consumes 35 minutes. There is scope to address these performance issues with spark parallelization and using appropriate data structures.

Attachments

Issue Links

links to

GitHub Pull Request #1768

Activity

People

Assignee:: Udit Mehrotra

Reporter:: Udit Mehrotra

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jun/20 00:51

Updated:: 10/Nov/21 09:54

Resolved:: 24/Aug/20 18:16