Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)
Format Name | CDC Source Required | Resource Cost(writer) | Resource Cost(reader) | Friendly to Streaming |
CDC | No | low/high | low/high (based on logging modes we choose) | No (the debezium style output is not what Flink needs for e.g) |
Changelog | Yes | low | low | Yes |
This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates.
(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)
- when before is `null`, emit I
- when after is `null`, emit D
- when both are non-null, emit two records +U and -U
(B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field
- this means, anyone querying this field, using a snapshot query, will break.
- we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.
(C) To support backwards compatibilty, we fallback to reading `_hoodie_operation` in 0.X tables.
For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has _hoodie_operation already, we fallback to reading _hoodie_operation from base file if mode=OP_KEY_ONLY.. Throw error for other modes.
(D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published.
This is already completed for Spark. so others should be easy to do.
(E) We need to complete a review of the CDC schema
ts - should be completion time or instant time?