Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7229

Enable partial updates for CDC work payload

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.1.0
    • None

    Description

      OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations.

      Goals

      1. Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication.
      2. Performance parity with current full-record updates or partial updates across the same set of columns
      3. Exhibit reduction in storage costs, by only storing the changed columns.
      4. Should also result in computation cost reductions by scanning/processing less data
      5. Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            vinoth Vinoth Chandar
            linliu Lin Liu

            Dates

              Created:
              Updated:

              Agile

                Completed Sprints:
                Sprint 2024-03-25 ended 26/Apr/24
                Sprint 2024-04-26 ended 05/Jun/24
                View on Board

                Slack

                  Issue deployment