Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29589

Data Loss in Sink GlobalCommitter during Task Manager recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 1.14.0
    • None
    • None
    • None

    Description

      Flink's Sink architecture with global committer seems to be vulnerable for data loss during Task Manager recovery. The entire checkpoint can be lost by GlobalCommitter resulting with data loss.

      Issue was observed in Delta Sink connector on a real 1.14.x cluster and was replicated using Flink's 1.14.6 Test Utils classes.

      Scenario:

      1.  Streaming source emitting constant number of events per checkpoint (20 events per commit for 5 commits in total, that gives 100 records).
      2.  Sink with parallelism > 1 with committer and GlobalCommitter elements.
      3.  Commiters processed committables for checkpointId 2.
      4.  GlobalCommitter throws exception (desired exception) during checkpointId 2 (third commit) while processing data from checkpoint 1 (it is expected to global committer architecture lag one commit behind in reference to rest of the pipeline).
      5. Task Manager recovery, source resumes sending data.
      6. Streaming source ends.
      7. We are missing 20 records (one checkpoint).

      What is happening is that during recovery, committers are performing "retry" on committables for checkpointId 2, however those committables, reprocessed from "retry" task are not emit downstream to the global committer. 

      The issue can be reproduced using Junit Test build with Flink's TestSink.
      The test was implemented here and it is based on other tests from `SinkITCase.java` class.
      The test reproduces the issue in more than 90% of runs.

      I believe that problem is somewhere around SinkOperator::notifyCheckpointComplete method. In there we see that Retry async task is scheduled however its result is never emitted downstream like it is done for regular flow one line above.

      Attachments

        Activity

          People

            Unassigned Unassigned
            KristoffSC Krzysztof Chmielewski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: