Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4256

Fine-grained recovery

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.9.0
    • Component/s: Runtime / Coordination
    • Labels:
      None

      Description

      When a task fails during execution, Flink currently resets the entire execution graph and triggers complete re-execution from the last completed checkpoint. This is more expensive than just re-executing the failed tasks.

      In many cases, more fine-grained recovery is possible.

      The full description and design is in the corresponding FLIP.

      https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures

      The detail desgin for version1 is https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#

        Attachments

          Issue Links

          1.
          Remove Serializabiliy of ExecutionGraph Sub-task Closed Stephan Ewen  
          2.
          Fix misleading ScheduleMode names Sub-task Closed Stephan Ewen  
          3.
          Initialize TaskExecutions directly with their starting state Sub-task Closed Stephan Ewen  
          4.
          The implementation of FailoverRegion. Sub-task Closed shuai.xu  
          5.
          The implementation of RestartPipelinedRegionStrategy Sub-task Closed shuai.xu  
          6.
          Implement a new RestartStrategy that works for the FailoverRegion. Sub-task Closed shuai.xu  
          7.
          ExecutionGraph use FailoverCoordinator to manage the failover of execution vertexes Sub-task Closed shuai.xu  
          8.
          Introduce a TerminationFuture for Execution Sub-task Closed Stephan Ewen  
          9.
          Introduce the abstract PartitionException for downstream task failure Sub-task Resolved zhijiang
          10.
          Batch Job: InputSplit Fault tolerant for DataSourceTask Sub-task Closed ryantaocer
          11.
          Backtrack failover regions if intermediate results are unavailable Sub-task Resolved Zhu Zhu
          12.
          Resetting ExecutionVertex in region failover may cause inconsistency of IntermediateResult status Sub-task Resolved Zhu Zhu
          13.
          Implement a region failover strategy based on new FailoverStrategy interfaces Sub-task Closed Zhu Zhu
          14.
          Introduce PartitionConnectionException for unreachable producer Sub-task Resolved zhijiang
          15.
          Implement ExecutionGraph to FailoverTopology Adapter Sub-task Closed Zhu Zhu
          16.
          Add an adapter of region failover NG for legacy scheduler Sub-task Closed Zhu Zhu
          17.
          IT test for fine-grained recovery (user code failures) Sub-task Closed Andrey Zagrebin
          18.
          Leverage JM side partition state to improve region failover experience Sub-task Closed Zhu Zhu
          19.
          IT test for fine-grained recovery (task executor failures) Sub-task Closed Andrey Zagrebin
          20.
          Run HA dataset E2E test with new RestartPipelinedRegionStrategy Sub-task Resolved Gary Yao
          21.
          Add documentation for AdaptedRestartPipelinedRegionStrategyNG Sub-task Resolved Zhu Zhu
          22.
          Set jobmanager.execution.failover-strategy to region in default flink-conf.yaml Sub-task Closed Chesnay Schepler

            Activity

              People

              • Assignee:
                StephanEwen Stephan Ewen
                Reporter:
                StephanEwen Stephan Ewen
              • Votes:
                0 Vote for this issue
                Watchers:
                42 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h Original Estimate - 168h
                  168h
                  Remaining:
                  Time Spent - 5h 10m Remaining Estimate - 167.5h
                  167.5h
                  Logged:
                  Time Spent - 5h 10m Remaining Estimate - 167.5h
                  5h 10m