Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13593

Prevent failing the wrong execution attempt in CheckpointFailureManager

    XMLWordPrintableJSON

Details

    Description

      Due to the asynchronously handling of checkpoint decline message in LegacyScheduler#declineCheckpoint, it's possible that the message is handled before job status transition thus receiveDeclineMessage grabbed the lock in CheckpointCoordinator before pendingCheckpoints got cleared by stopCheckpointScheduler (as triggered by the job status listener CheckpointCoordinatorDeActivator). And if the job/tasks restarts quickly enough, the FailJobCallback in CheckpointFailureManager might unexpectedly fail the job again, as observed in FLINK-13527.

      To resolve the issue, we need to add a safe guard when failing the job, passing through the ExecutionAttemptID and checking against the current executions to make sure the to-be-failed one is still running, so we won't fail the newly restarted one by accident.

      Attachments

        Issue Links

          Activity

            People

              liyu Yu Li
              liyu Yu Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m