[HUDI-6553] Speedup column stats and bloom index creation on large datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available
- release-0.14.0-blocker

Description

During initialization of column_stats and bloom_filter MDT partitions, the code which creates the records for these partitions is written as such:

Create a Map of partitionName -> List of files in partition
Parallelize the above Map
Each executor handles a single partition

For large datasets the above design cause the following limitations:

Each executor handles a single partition. So we cannot speed up by throwing more executors.
If one partitions has much larger number of files than other partitions, then a single executor would be the bottleneck for the initialization completion and other executors would be idle.

In this enhancement I am changing the parallelism to be at a file-level.

Attachments

Issue Links

links to

GitHub Pull Request #9223

GitHub Pull Request #9449

Activity

People

Assignee:: Prashant Wason

Reporter:: Prashant Wason

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Jul/23 11:38

Updated:: 15/Aug/23 19:38