Details
-
Sub-task
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
Description
I have identified 3 performance bottleneck in the finalizeWrite function, that are manifesting and becoming more prominent with the new bootstrap mechanism on S3:
- https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425 is a serial operation performed at the driver and it can take a long time when you have several partitions and large number of files.
- The invalid data paths are being stored in a List instead of Set and as a result the following operation becomes N^2 taking significant time to compute at the driver: https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429
- https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473 does a recursive delete of the marker directory at the driver. This is again extremely expensive when you have large number of partitions and files.
Upon testing with a 1 TB data set, having 8000 partitions and approximately 190000 files this whole process consumes 35 minutes. There is scope to address these performance issues with spark parallelization and using appropriate data structures.
Attachments
Issue Links
- links to