Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1509

Major performance degradation due to rewriting records with default values

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.6.0, 0.7.0, 0.8.0
    • Fix Version/s: 0.7.0
    • Component/s: None

      Description

      During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit

      [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

      I wrote a unit test to reduce the scope of testing as follows:

      1. Take an existing parquet file from production dataset (size=690MB, #records=960K)
      2. Read all the records from this parquet into a JavaRDD
      3. Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

      The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset.

      The time to complete the bulk insert prepped decreased from 680seconds to 380seconds when I reverted the above commit.

      Schema details: This HUDI dataset uses a large schema with 51 fields in the record.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                nishith29 Nishith Agarwal
                Reporter:
                pwason Prashant Wason
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: