Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-8871

Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

    XMLWordPrintableJSON

    Details

    • Release Note:
      Hide
      FLINK-8871 helps to abort checkpoint more eagerly, and added a new interface #notifyCheckpointAborted(long checkpointId) in CheckpointListener.
      Show
      FLINK-8871 helps to abort checkpoint more eagerly, and added a new interface #notifyCheckpointAborted(long checkpointId) in CheckpointListener.

      Description

      Flink currently lacks any form of feedback mechanism from the job manager / checkpoint coordinator to the tasks when it comes to failing a checkpoint. This means that running snapshots on the tasks are also not stopped even if their owning checkpoint is already cancelled. Two examples for cases where this applies are checkpoint timeouts and local checkpoint failures on a task together with a configuration that does not fail tasks on checkpoint failure. Notice that those running snapshots do no longer account for the maximum number of parallel checkpoints, because their owning checkpoint is considered as cancelled.

      Not stopping the task's snapshot thread can lead to a problematic situation where the next checkpoints already started, while the abandoned checkpoint thread from a previous checkpoint is still lingering around running. This scenario can potentially cascade: many parallel checkpoints will slow down checkpointing and make timeouts even more likely.

       

      A possible solution is introducing a cancelCheckpoint method  as counterpart to the triggerCheckpoint method in the task manager gateway, which is invoked by the checkpoint coordinator as part of cancelling the checkpoint.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yunta Yun Tang
                Reporter:
                srichter Stefan Richter
              • Votes:
                1 Vote for this issue
                Watchers:
                22 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m