[FLINK-20465] Fail globally when not resuming from the latest checkpoint in regional failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.12.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

As a follow up for ~~FLINK-20290~~ we should assert that we resume from the latest checkpoint when doing a regional failover in the SourceCoordinators in order to avoid losing input splits (see ~~FLINK-20427~~). If the assumption does not hold, then we should fail the job globally so that we reset the master state to a consistent view of the state. Such a behaviour can act as a safety net in case that Flink ever tries to recover from not the latest available checkpoint.

One idea how to solve it is to remember the latest completed checkpoint id somewhere along the way to the SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment and failing when the restored checkpoint id is smaller.

cc sewen, jqin

Attachments

Issue Links

is related to

FLINK-20290 Duplicated output in FileSource continuous ITCase with TM failover

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Dec/20 09:06

Updated:: 19/Dec/21 10:39