[FLINK-13593] Prevent failing the wrong execution attempt in CheckpointFailureManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.9.0
Fix Version/s: 1.9.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

Due to the asynchronously handling of checkpoint decline message in LegacyScheduler#declineCheckpoint, it's possible that the message is handled before job status transition thus receiveDeclineMessage grabbed the lock in CheckpointCoordinator before pendingCheckpoints got cleared by stopCheckpointScheduler (as triggered by the job status listener CheckpointCoordinatorDeActivator). And if the job/tasks restarts quickly enough, the FailJobCallback in CheckpointFailureManager might unexpectedly fail the job again, as observed in ~~FLINK-13527~~.

To resolve the issue, we need to add a safe guard when failing the job, passing through the ExecutionAttemptID and checking against the current executions to make sure the to-be-failed one is still running, so we won't fail the newly restarted one by accident.

Attachments

Issue Links

causes

FLINK-13527 Instable KafkaProducerExactlyOnceITCase due to CheckpointFailureManager

Closed

is caused by

FLINK-13695 Integrate checkpoint notifications into StreamTask's lifecycle

Open

is duplicated by

FLINK-13527 Instable KafkaProducerExactlyOnceITCase due to CheckpointFailureManager

Closed

links to

GitHub Pull Request #9364

Activity

People

Assignee:: Yu Li

Reporter:: Yu Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Aug/19 05:55

Updated:: 12/Aug/19 09:59

Resolved:: 09/Aug/19 12:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m