[FLINK-8871] Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Implemented
Affects Version/s: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
Fix Version/s: 1.11.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Release Note:

Hide
~~FLINK-8871~~ helps to abort checkpoint more eagerly, and added a new interface #notifyCheckpointAborted(long checkpointId) in CheckpointListener.

Show
FLINK-8871 helps to abort checkpoint more eagerly, and added a new interface #notifyCheckpointAborted(long checkpointId) in CheckpointListener.

Description

Flink currently lacks any form of feedback mechanism from the job manager / checkpoint coordinator to the tasks when it comes to failing a checkpoint. This means that running snapshots on the tasks are also not stopped even if their owning checkpoint is already cancelled. Two examples for cases where this applies are checkpoint timeouts and local checkpoint failures on a task together with a configuration that does not fail tasks on checkpoint failure. Notice that those running snapshots do no longer account for the maximum number of parallel checkpoints, because their owning checkpoint is considered as cancelled.

Not stopping the task's snapshot thread can lead to a problematic situation where the next checkpoints already started, while the abandoned checkpoint thread from a previous checkpoint is still lingering around running. This scenario can potentially cascade: many parallel checkpoints will slow down checkpointing and make timeouts even more likely.

A possible solution is introducing a cancelCheckpoint method as counterpart to the triggerCheckpoint method in the task manager gateway, which is invoked by the checkpoint coordinator as part of cancelling the checkpoint.

Attachments

Issue Links

blocks

FLINK-15507 Activate local recovery for RocksDB backends by default

Open

causes

FLINK-18238 RemoteChannelThroughputBenchmark deadlocks

Closed

is duplicated by

FLINK-13808 Checkpoints expired by timeout may leak RocksDB files

Reopened

FLINK-9375 Introduce AbortCheckpoint message from JM to TMs

Closed

FLINK-10966 Optimize the release blocking logic in BarrierBuffer

Closed

FLINK-12058 Cancel checkpoint operations belonging to a discarded/aborted checkpoint

Closed

relates to

FLINK-10966 Optimize the release blocking logic in BarrierBuffer

Closed

requires

FLINK-14652 Refactor checkpointing related parts into one place on task side

Closed

links to

Checkpoint cancellation design doc

GitHub Pull Request #8693

(1 is duplicated by, 1 relates to, 1 requires, 2 links to)

Activity

People

Assignee:: Yun Tang

Reporter:: Stefan Richter

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 05/Mar/18 17:02

Updated:: 16/Oct/20 10:53

Resolved:: 21/May/20 10:02

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m