Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
1.15.0
-
None
-
None
Description
In contrast to the AdaptiveScheduler, the DefaultScheduler fails fatally in case of an error while cleaning up the checkpoint-related resources. This contradicts our new approach of retrying the cleanup of job-related data (see FLINK-25433). Instead, we would want the DefaultScheduler to return an exceptionally completed future with the exception. This enables the DefaultResourceCleaner to trigger a retry.
Both scheduler implementations do not expose the error during shutdown of the CompletedCheckpointStore or CheckpointIDCounter right now. This would need to be addressed as well.
Attachments
Issue Links
- Discovered while testing
-
FLINK-25974 Make cancellation of jobs depend on the JobResultStore
- Resolved
- is blocked by
-
FLINK-26741 CheckpointIDCounter.shutdown should expose errors asynchronously
- Resolved
-
FLINK-26742 DefaultCompletedCheckpointStore.shutdown does not clean the checkpoints atomically
- Closed
- is related to
-
FLINK-25433 Integrate retry strategy for cleanup stage
- Closed
- relates to
-
FLINK-27355 JobManagerRunnerRegistry.localCleanupAsync does not call the JobManagerRunner.close method repeatedly
- Open