Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1628 [Umbrella] Improve data locality during ingestion
  3. HUDI-860

Ability to do small file handling without need for caching

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • 0.11.0
    • Writer Core
    • None

    Description

      As of now, in upsert path,

      • hudi builds a workloadProfile to understand total inserts and updates(with location info) 
      • Following which, small files info are populated
      • Then buckets are populated with above info. 
      • These buckets are later used when getPartition(Object key) is invoked in UpsertPartitioner.

      In step1: to build global workload profile, we had to do an action on entire JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile as well. 

      For large write intensive batch jobs(COW types), caching this incurs additional overhead. So, this effort is trying to see if we can avoid doing this by some means. 

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            alexey.kudinkin Alexey Kudinkin
            vinoth Vinoth Chandar

            Dates

              Created:
              Updated:

              Issue deployment