Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1509

Major performance degradation due to rewriting records with default values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.6.0, 0.7.0, 0.8.0
    • 0.7.0
    • None

    Description

      During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit

      [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

      I wrote a unit test to reduce the scope of testing as follows:

      1. Take an existing parquet file from production dataset (size=690MB, #records=960K)
      2. Read all the records from this parquet into a JavaRDD
      3. Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

      The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset.

      The time to complete the bulk insert prepped decreased from 680seconds to 380seconds when I reverted the above commit.

      Schema details: This HUDI dataset uses a large schema with 51 fields in the record.

      Attachments

        Issue Links

          Activity

            People

              nishith29 Nishith Agarwal
              pwason Prashant Wason
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: