[FLINK-22088] CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.12.2, 1.13.0
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
- auto-unassigned
- stale-assigned

Description

Currently when job failover, it would try to cancel all the pending checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> stopCheckpointScheduler, it would try to cancel all the pending checkpoints and also set periodicScheduling to false.

If at this time there is just one checkpoint start triggering, it might acquire all the execution to trigger before failover and start triggering. ideally it should be aborted in createPendingCheckpoint-> preCheckGlobalState. However, since the check and creating pending checkpoint is in two different scope, there might be cases the CheckpointCoordinator#stopCheckpointScheduler happens during the two lock scope.

We may optimize this checking; However, since the execution would finally fail to trigger checkpoint, it should not affect the rightness of the job. Besides, even if we optimize it, there might still be cases that the execution trigger failed due to concurrent failover.

Attachments

Issue Links

relates to

FLINK-22003 UnalignedCheckpointITCase fail

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yun Gao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Apr/21 10:05

Updated:: 01/Nov/21 10:58