Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13452

Pipelined region failover strategy does not recover Job if checkpoint cannot be read

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The job does not recover if a checkpoint cannot be read and jobmanager.execution.failover-strategy is set to "region".

      Analysis

      The RestartCallback created by AdaptedRestartPipelinedRegionStrategyNG throws a RuntimeException if no checkpoints could be read. When the restart is invoked in a separate thread pool, the exception is swallowed. See:

      https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119

      https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65

      Expected behavior

      • Job should restart

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yunta Yun Tang
            gjy Gary Yao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                Slack

                  Issue deployment