Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20465

Fail globally when not resuming from the latest checkpoint in regional failover

    XMLWordPrintableJSON

Details

    Description

      As a follow up for FLINK-20290 we should assert that we resume from the latest checkpoint when doing a regional failover in the SourceCoordinators in order to avoid losing input splits (see FLINK-20427). If the assumption does not hold, then we should fail the job globally so that we reset the master state to a consistent view of the state. Such a behaviour can act as a safety net in case that Flink ever tries to recover from not the latest available checkpoint.

      One idea how to solve it is to remember the latest completed checkpoint id somewhere along the way to the SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment and failing when the restored checkpoint id is smaller.

      cc sewen, jqin

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: