[FLINK-10288] Failover Strategy improvement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

Flink pays significant efforts to make Streaming Job fault tolerant. The checkpoint mechanism and exactly once semantics make Flink different than other systems. However, there are still some cases not been handled very well. Those cases can apply to both Streaming and Batch scenarios, and its orthogonal with current fault tolerant mechanism. Here is a summary of those cases:

Some failures are non-recoverable, such as a user error: DividebyZeroException. We shouldn't try to restart the task, as it will never succeed. The DivideByZeroException is just a simple case, those errors sometime are not easy to reproduce or predict, as it might be only triggered by specific input data, we shouldn’t retry for all user code exceptions.
There is no limit for task retry today, unless a SuppressRestartException was encountered, a task will keep on retrying until it succeeds. As mentioned above, we shouldn’t retry for some cases at all, and for the Exceptions we can retry, such as a network exception, should we have a retry limit? We need retry for any transient issue, but we also need to set a limit to avoid infinite retry and resource wasting. For Batch and Streaming workload, we might need different strategies.
There are some exceptions due to hardware issues, such as disk/network malfunction. when a task/TaskManager fail on this, we’d better detect and avoid to schedule to that machine next time.
If a task read from a blocking result partition, when its input is not available, we can ‘revoke’ the produce task, set the task fail and rerun the upstream task to regenerate data. the revoke can propagate up through the chain. In Spark, revoke is naturally support by lineage.

To make fault tolerance easier, we need to keep deterministic behavior as much as possible. For user code, it’s not easy to control. However, for system related code, we can fix it. For example, we should at least make sure the different attempt of a same task to have the same inputs (we have a bug in current codebase (DataSourceTask) that cannot guarantee this). Note that this is track by [Flink-10205]

Details see this proposal:

https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing

Attachments

Issue Links

is related to

FLINK-4256 Fine-grained recovery

Closed

Sub-Tasks

1.	Classify Exceptions to different category for apply different failover strategy	Closed	JIN SUN
2.	Batch Job Failover Strategy	Closed	ryantaocer
3.	Enable Per-job level failover strategy.	Closed	ryantaocer
4.	Support task revocation	Closed	ryantaocer
5.	Update documents of new batch job failover strategy	Closed	ryantaocer
6.	Node health manager	Closed	ryantaocer

Activity

People

Assignee:: ryantaocer

Reporter:: JIN SUN

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 06/Sep/18 07:17

Updated:: 09/Oct/20 12:43

Resolved:: 09/Oct/20 12:43