Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
Description
While HUDI-328 introduced Delete API, I noticed deduplicateKeys method doesn't allow any parallelism for RDD operation while deduplicateRecords for upsert uses parallelism on RDD.
And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.
I found certain cases, like input RDD has few parallelism but target table has large files, certain Spark job's performance is suffered from low parallelism. so in this case, upsert performance with "EmptyHoodieRecordPayload" is faster than delete API.
Also this is due to the fact that "hoodie.combine.before.upsert" is true by default, when it's not enabled, the issue would be the same.
So I wonder input RDD should be repartition as "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is false for better performance regardless of "hoodie.combine.before.delete"
Attachments
Issue Links
- is related to
-
HUDI-328 Add support for Delete api in HoodieWriteClient
- Closed
- links to