Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5095

Flink: Stores a special watermark(flag) to identify the current progress of writing data

    XMLWordPrintableJSON

Details

    Description

      In some cases where we need a flag to measure the progress of data writing, I think it is a reasonable way to store the watermark as an attribute of the hudi commit metadata.

      One of our scenarios is that Flink writes data to Hudi table in real time, and then we use this Hudi table to support batch computation, so we need a flag to evaluate whether its partition data is complete.

      For example, job1 is scheduled every hour. At 2022-01-19 02:01:00, job1 starts to check whether the partition (20220119/01) of hudi_table1 is completed (Flink writes data to hudi_table1 in real time). When the watermark properties of hudi_table1‘s commit metadata are higher than 2022- 01-19 02:05:00 Update (5 minutes out of order), we consider partition(20220119/01) as completed and we can safely execute Hive or Flink sql for batch computation. (basically insert table2 select xx from hudi_table1...)

      Attachments

        Issue Links

          Activity

            People

              MengYue yuemeng
              x1q1j1 Forward Xu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: