[HUDI-860] Ability to do small file handling without need for caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: writer-core
Labels:
None

Epic Link:
Improve data locality during ingestion

Description

As of now, in upsert path,

hudi builds a workloadProfile to understand total inserts and updates(with location info)
Following which, small files info are populated
Then buckets are populated with above info.
These buckets are later used when getPartition(Object key) is invoked in UpsertPartitioner.

In step1: to build global workload profile, we had to do an action on entire JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile as well.

For large write intensive batch jobs(COW types), caching this incurs additional overhead. So, this effort is trying to see if we can avoid doing this by some means.

Attachments

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Vinoth Chandar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/May/20 01:53

Updated:: 20/Aug/22 23:08