Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5667

Possible state data loss when task fails while checkpointing

    Details

      Description

      It is possible that Flink loses state data when a Task fails while a checkpoint is being drawn. The scenario is the following:

      Flink has finished the synchronous checkpointing part and starts the asynchronous part by creating and submitting a AsyncCheckpointRunnable to an Executor. This runnable is also registered at the closeable registry. If the Task now fails before the AsyncCheckpointRunnable has completed, it will be closed due to being registered in the closeable registry. The closing operation will discard all state handles and cancel all runnable state futures. However, it will not stop the runnable from sending an acknowledge message to the CheckpointCoordinator.

      If this message completes the pending checkpoint, then this checkpoint will be transformed into a CompletedCheckpoint which is faulty (some of the data has already been deleted). Depending on Flink's configuration, this will discard older completed checkpoints and thus we will have state data loss.

        Attachments

          Activity

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: