[HUDI-993] Use hoodie.delete.shuffle.parallelism for Delete API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.7.0
Component/s: performance
Labels:
- pull-request-available

Description

While ~~HUDI-328~~ introduced Delete API, I noticed deduplicateKeys method doesn't allow any parallelism for RDD operation while deduplicateRecords for upsert uses parallelism on RDD.

And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.

I found certain cases, like input RDD has few parallelism but target table has large files, certain Spark job's performance is suffered from low parallelism. so in this case, upsert performance with "EmptyHoodieRecordPayload" is faster than delete API.

Also this is due to the fact that "hoodie.combine.before.upsert" is true by default, when it's not enabled, the issue would be the same.

So I wonder input RDD should be repartition as "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is false for better performance regardless of "hoodie.combine.before.delete"

Attachments

Issue Links

is related to

HUDI-328 Add support for Delete api in HoodieWriteClient

Closed

links to

GitHub Pull Request #1703

https://github.com/apache/hudi/pull/1703

Activity

People

Assignee:: Unassigned

Reporter:: Dongwook Kwon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Jun/20 01:03

Updated:: 21/Jan/21 06:20

Resolved:: 21/Jan/21 06:20