Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13452

Pipelined region failover strategy does not recover Job if checkpoint cannot be read

    XMLWordPrintableJSON

Details

    Description

      The job does not recover if a checkpoint cannot be read and jobmanager.execution.failover-strategy is set to "region".

      Analysis

      The RestartCallback created by AdaptedRestartPipelinedRegionStrategyNG throws a RuntimeException if no checkpoints could be read. When the restart is invoked in a separate thread pool, the exception is swallowed. See:

      https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119

      https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65

      Expected behavior

      • Job should restart

       

      Attachments

        1. jobmanager.log
          2.21 MB
          Gary Yao

        Issue Links

          Activity

            People

              yunta Yun Tang
              gjy Gary Yao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h