Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1398

Align insert file size for reducing IO

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.7.0
    • None

    Description

      currently we insert totalUnassignedInserts into new file if we have anything more records

      and set number of new bucket records as follow:

      recordsPerBucket.add(totalUnassignedInserts / insertBuckets); (https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java L 188)

      it just compute the avg records. and it may create new small file

      for example:

      totalUnassignedInserts = 250

      insertRecordsPerBucket = 120

      so insertBuckets = 3 (eg. file_a,file_b,file_c)

      then  file_a = file_b = file_c = 83 

      the small files will include above three file when next delta process

      and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 

      Attachments

        Activity

          People

            Unassigned Unassigned
            henryz steven zhang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: