[FLINK-26114] DefaultScheduler fails fatally in case of an error when shutting down the checkpoint-related resources - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.15.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

In contrast to the AdaptiveScheduler, the DefaultScheduler fails fatally in case of an error while cleaning up the checkpoint-related resources. This contradicts our new approach of retrying the cleanup of job-related data (see ~~FLINK-25433~~). Instead, we would want the DefaultScheduler to return an exceptionally completed future with the exception. This enables the DefaultResourceCleaner to trigger a retry.

Both scheduler implementations do not expose the error during shutdown of the CompletedCheckpointStore or CheckpointIDCounter right now. This would need to be addressed as well.

Attachments

Issue Links

Discovered while testing

FLINK-25974 Make cancellation of jobs depend on the JobResultStore

Resolved

is blocked by

FLINK-26741 CheckpointIDCounter.shutdown should expose errors asynchronously

Resolved

FLINK-26742 DefaultCompletedCheckpointStore.shutdown does not clean the checkpoints atomically

Closed

is related to

FLINK-25433 Integrate retry strategy for cleanup stage

Closed

relates to

FLINK-27355 JobManagerRunnerRegistry.localCleanupAsync does not call the JobManagerRunner.close method repeatedly

Open

Activity

People

Assignee:: Atri Sharma

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Feb/22 08:08

Updated:: 11/May/22 09:50