[HUDI-7906] improve the parallelism deduce in rdd write - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0, 1.0.0
Component/s: None
Labels:
- pull-request-available

Description

as https://github.com/apache/hudi/issues/11274 and https://github.com/apache/hudi/pull/11463 describe, there has two case question.

if the rdd is input rdd without shuffle, the partitiion number is too bigger or too small
user need can not control it easy
1. in some case user can set `spark.default.parallelism` change it.
2. in some case user can not change because hard-code
3. and in spark, the better way is use `spark.default.parallelism` or `spark.sql.shuffle.partitions` can control it, other is advanced in hudi.

Attachments

Attachments

Issue Links

Add Link

is caused by

HUDI-4924 Dedup parallelism is not auto tuned based on input

Closed

Delete this link

links to

GitHub Pull Request #11470

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: KnightChess Assign to me

Reporter:: KnightChess

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Jun/24 15:47

Updated:: 24/Jun/24 14:11

Resolved:: 22/Jun/24 04:30

Agile

Slack

Issue deployment