Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.1.0
-
None
Description
When a task fails during execution, Flink currently resets the entire execution graph and triggers complete re-execution from the last completed checkpoint. This is more expensive than just re-executing the failed tasks.
In many cases, more fine-grained recovery is possible.
The full description and design is in the corresponding FLIP.
The detail desgin for version1 is https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#
Attachments
Issue Links
- is blocked by
-
FLINK-4322 Unify CheckpointCoordinator and SavepointCoordinator
- Closed
- relates to
-
FLINK-13371 Release partitions in JM if producer restarts
- Closed
-
FLINK-12069 Add proper lifecycle management for intermediate result partitions
- Closed
-
FLINK-10288 Failover Strategy improvement
- Closed
- links to