[HUDI-7538] Consolidate the CDC Formats (changelog format, RFC-51) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: storage-management
Labels:
None

Epic Link:
1.X Format Changes

Description

For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)

Format Name	CDC Source Required	Resource Cost(writer)	Resource Cost(reader)	Friendly to Streaming
CDC	No	low/high	low/high (based on logging modes we choose)	No (the debezium style output is not what Flink needs for e.g)
Changelog	Yes	low	low	Yes

This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates.

(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)

when before is `null`, emit I
when after is `null`, emit D
when both are non-null, emit two records +U and -U

(B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field

this means, anyone querying this field, using a snapshot query, will break.
we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has _hoodie_operation already, we fallback to reading _hoodie_operation from base file if mode=OP_KEY_ONLY.. Throw error for other modes.

(D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published.

This is already completed for Spark. so others should be easy to do.

(E) We need to complete a review of the CDC schema

ts - should be completion time or instant time?

Attachments

Activity

People

Assignee:: Vinoth Chandar

Reporter:: Vinoth Chandar

Reviewers:: Danny Chen, Ethan Guo

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Mar/24 12:26

Updated:: Yesterday 02:13