Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7538

Consolidate the CDC Formats (changelog format, RFC-51)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • 1.1.0
    • storage-management
    • None

    Description

      For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)

       

      Format Name CDC Source Required Resource Cost(writer) Resource Cost(reader) Friendly to Streaming
      CDC No low/high low/high (based on logging modes we choose) No (the debezium style output is not what Flink needs for e.g)
      Changelog Yes low low Yes

      This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates. 

       

      (A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)

      • when before is `null`, emit I
      • when after is `null`, emit D
      • when both are non-null, emit two records +U and -U

      (B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field 

      1. this means, anyone querying this field, using a snapshot query, will break.
      2. we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

      (C) To support backwards compatibilty, we fallback to reading `_hoodie_operation` in 0.X tables. 

      For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has _hoodie_operation already, we fallback to reading _hoodie_operation from base file if mode=OP_KEY_ONLY.. Throw error for other modes. 

      (D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published. 

       This is already completed for Spark. so others should be easy to do. 

       

      (E) We need to complete a review of the CDC schema

      ts - should be completion time or instant time?

       

       

      Attachments

        Activity

          People

            vinoth Vinoth Chandar
            vinoth Vinoth Chandar
            Danny Chen, Ethan Guo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: