[FLINK-8042] Retry individual failover-strategy for some time first before reverting to full job restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Won't Fix
Affects Version/s: 1.3.2
Fix Version/s: 1.5.0
Component/s: Runtime / Coordination, Runtime / State Backends
Labels:
None

Description

Let's say we lost a taskmanager node. When Flink tries to attempt fine grained recovery and fails replacement taskmanager node didn't come back in time, it reverts to full job restart.

Stephan and Till was suggesting that Flink can/should retry fine grained recovery for some time before giving up and reverting full job restart

Attachments

Issue Links

relates to

FLINK-8043 change fullRestarts (for fine grained recovery) from guage to counter

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Steven Zhen Wu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Nov/17 01:29

Updated:: 02/Oct/19 17:43

Resolved:: 12/Mar/18 14:19