Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7229

Enable partial updates for CDC work payload

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.1.0
    • None

    Description

      OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations.

      Goals

      1. Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication.
      2. Performance parity with current full-record updates or partial updates across the same set of columns
      3. Exhibit reduction in storage costs, by only storing the changed columns.
      4. Should also result in computation cost reductions by scanning/processing less data
      5. Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically.

       

      Attachments

        Issue Links

          Activity

            People

              vinoth Vinoth Chandar
              linliu Lin Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: