[FLINK-4256] Fine-grained recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.9.0
Component/s: Runtime / Coordination
Labels:
None

Description

When a task fails during execution, Flink currently resets the entire execution graph and triggers complete re-execution from the last completed checkpoint. This is more expensive than just re-executing the failed tasks.

In many cases, more fine-grained recovery is possible.

The full description and design is in the corresponding FLIP.

https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures

The detail desgin for version1 is https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#

Attachments

Issue Links

is blocked by

FLINK-4322 Unify CheckpointCoordinator and SavepointCoordinator

Closed

relates to

FLINK-13371 Release partitions in JM if producer restarts

Closed

FLINK-12069 Add proper lifecycle management for intermediate result partitions

Closed

FLINK-10288 Failover Strategy improvement

Closed

links to

GitHub Pull Request #3539

Sub-Tasks

Remove Serializabiliy of ExecutionGraph

Closed

Stephan Ewen

Fix misleading ScheduleMode names

Closed

Stephan Ewen

Initialize TaskExecutions directly with their starting state

Closed

Stephan Ewen

The implementation of FailoverRegion.

Closed

shuai.xu

The implementation of RestartPipelinedRegionStrategy

Closed

shuai.xu

Implement a new RestartStrategy that works for the FailoverRegion.

Closed

shuai.xu

ExecutionGraph use FailoverCoordinator to manage the failover of execution vertexes

Closed

shuai.xu

Introduce a TerminationFuture for Execution

Closed

Stephan Ewen

Introduce the abstract PartitionException for downstream task failure

Closed

Zhijiang

100%

10.

Batch Job: InputSplit Fault tolerant for DataSourceTask

Closed

ryantaocer

11.

Backtrack failover regions if intermediate results are unavailable

Resolved

Zhu Zhu

100%

12.

Resetting ExecutionVertex in region failover may cause inconsistency of IntermediateResult status

Resolved

Zhu Zhu

100%

13.

Implement a region failover strategy based on new FailoverStrategy interfaces

Closed

Zhu Zhu

100%

14.

Introduce PartitionConnectionException for unreachable producer

Resolved

Zhijiang

100%

15.

Implement ExecutionGraph to FailoverTopology Adapter

Closed

Zhu Zhu

100%

16.

Add an adapter of region failover NG for legacy scheduler

Closed

Zhu Zhu

100%

17.

IT test for fine-grained recovery (user code failures)

Closed

Andrey Zagrebin

100%

18.

Leverage JM side partition state to improve region failover experience

Closed

Zhu Zhu

100%

19.

IT test for fine-grained recovery (task executor failures)

Closed

Andrey Zagrebin

100%

20.

Run HA dataset E2E test with new RestartPipelinedRegionStrategy

Resolved

Gary Yao

100%

21.

Add documentation for AdaptedRestartPipelinedRegionStrategyNG

Resolved

Zhu Zhu

100%

22.

Set jobmanager.execution.failover-strategy to region in default flink-conf.yaml

Closed

Chesnay Schepler

100%

Activity

People

Assignee:: Stephan Ewen

Reporter:: Stephan Ewen

Votes:: 0 Vote for this issue

Watchers:: 39 Start watching this issue

Dates

Created:: 22/Jul/16 16:37

Updated:: 02/Oct/19 17:45

Resolved:: 16/Aug/19 10:41

Time Tracking

Estimated:

168h

Remaining:

167.5h

Logged:

5h 10m

Include sub-tasks

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Time Tracking

Not Specified