[FLINK-20222] The CheckpointCoordinator should reset the OperatorCoordinators when fail before the first checkpoint. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11.3, 1.12.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

Right now, if a job failed before the first successful checkpoint, the CheckpointCoordinator will not reset the OperatorCoordinator state. This may leave the OperatorCoordinators in inconsistent state.

The CheckpointCoordinator should also reset the OperatorCoordinator state in this case, just like it does for the master hooks. It essentially means "reset to no checkpoint". There are two options for the fix:

Add a reset() method to the OperatorCoordinator.
Call resetToCheckpoint(null) on the OperatorCoordinator.

Attachments

Issue Links

blocks

FLINK-20157 SourceCoordinatorProvider kills JobManager with IllegalStateException on job submission

Closed

links to

GitHub Pull Request #14186

Activity

People

Assignee:: Stephan Ewen

Reporter:: Jiangjie Qin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Nov/20 13:37

Updated:: 12/Dec/20 11:57

Resolved:: 24/Nov/20 13:13