[HUDI-724] Parallelize GetSmallFiles For Partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0, 0.5.3
Component/s: performance, writer-core
Labels:
- pull-request-available

Description

When writing data, a gap was observed between spark stages. By tracking down where the time was spent on the spark driver, it's get-small-files operation for partitions.

When creating the UpsertPartitioner and trying to assign insert records, it uses a normal for-loop for get the list of small files for all partitions that the load is going to load data to, and the process is very slow when there are a lot of partitions to go through. While the operation is running on spark driver process, all other worker nodes are sitting idle waiting for tasks.

For all those partitions, they don't affect each other, so the get-small-files operations can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions and eventually send the get small files operations to multiple tasks.

screenshot attached for

the gap without the improvement

the spark stage with the improvement (no gap)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gap.png
19/Mar/20 23:26
207 kB
Feichi Feng
nogapAfterImprovement.png
19/Mar/20 23:31
94 kB
Feichi Feng

Issue Links

links to

GitHub Pull Request #1421

Activity

People

Assignee:: Unassigned

Reporter:: Feichi Feng

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Mar/20 23:31

Updated:: 17/May/20 21:49

Resolved:: 30/Mar/20 16:59

Time Tracking

Estimated:

48h

Remaining:

47h 20m

Logged:

40m