Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations.
Goals
- Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication.
- Performance parity with current full-record updates or partial updates across the same set of columns
- Exhibit reduction in storage costs, by only storing the changed columns.
- Should also result in computation cost reductions by scanning/processing less data
- Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically.
Attachments
Issue Links
- links to