[HUDI-6990] Configurable clustering task parallelism - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0-beta1, 0.14.1
Component/s: clustering
Labels:
- pull-request-available

Description

Spark executes clustering job will read clustering plan which contains multiple groups. Each group process many base files or log files. When we config param `
hoodie.clustering.plan.strategy.sort.columns`, we read those files through spark's parallelize method, every file read will generate one sub task. It's unreasonable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

after-subtasks.png
26/Oct/23 07:51
83 kB
askwang
before-subtasks.png
26/Oct/23 07:51
57 kB
askwang

Issue Links

links to

GitHub Pull Request #9925

Activity

People

Assignee:: Unassigned

Reporter:: askwang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Oct/23 07:13

Updated:: 06/Nov/23 01:30

Resolved:: 04/Nov/23 05:36