Details
-
Improvement
-
Status: Open
-
Not a Priority
-
Resolution: Unresolved
-
1.12.0
-
None
Description
As a follow up for FLINK-20290 we should assert that we resume from the latest checkpoint when doing a regional failover in the SourceCoordinators in order to avoid losing input splits (see FLINK-20427). If the assumption does not hold, then we should fail the job globally so that we reset the master state to a consistent view of the state. Such a behaviour can act as a safety net in case that Flink ever tries to recover from not the latest available checkpoint.
One idea how to solve it is to remember the latest completed checkpoint id somewhere along the way to the SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment and failing when the restored checkpoint id is smaller.
Attachments
Issue Links
- is related to
-
FLINK-20290 Duplicated output in FileSource continuous ITCase with TM failover
- Closed