Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.9.0
Description
Due to the asynchronously handling of checkpoint decline message in LegacyScheduler#declineCheckpoint, it's possible that the message is handled before job status transition thus receiveDeclineMessage grabbed the lock in CheckpointCoordinator before pendingCheckpoints got cleared by stopCheckpointScheduler (as triggered by the job status listener CheckpointCoordinatorDeActivator). And if the job/tasks restarts quickly enough, the FailJobCallback in CheckpointFailureManager might unexpectedly fail the job again, as observed in FLINK-13527.
To resolve the issue, we need to add a safe guard when failing the job, passing through the ExecutionAttemptID and checking against the current executions to make sure the to-be-failed one is still running, so we won't fail the newly restarted one by accident.
Attachments
Issue Links
- causes
-
FLINK-13527 Instable KafkaProducerExactlyOnceITCase due to CheckpointFailureManager
- Closed
- is caused by
-
FLINK-13695 Integrate checkpoint notifications into StreamTask's lifecycle
- Open
- is duplicated by
-
FLINK-13527 Instable KafkaProducerExactlyOnceITCase due to CheckpointFailureManager
- Closed
- links to