[FLINK-21884] Reduce TaskManager failure detection time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.14.0, 1.13.2
Fix Version/s: 1.15.0
Component/s: Runtime / Coordination
Labels:
- reactive

Description

In Flink 1.13 (and older versions), TaskManager failures stall the processing for a significant amount of time, even though the system gets indications for the failure almost immediately through network connection losses.

This is due to a high (default) heartbeat timeout of 50 seconds [1] to accommodate for GC pauses, transient network disruptions or generally slow environments (otherwise, we would unregister a healthy TaskManager).

Such a high timeout can lead to disruptions in the processing (no processing for certain periods, high latencies, buildup of consumer lag etc.). In Reactive Mode (~~FLINK-10407~~), the issue surfaces on scale-down events, where the loss of a TaskManager is immediately visible in the logs, but the job is stuck in "FAILING" for quite a while until the TaskManger is really deregistered. (Note that this issue is not that critical in a autoscaling setup, because Flink can control the scale-down events and trigger them proactively)

On the attached metrics dashboard, one can see that the job has significant throughput drops / consumer lags during scale down (and also CPU usage spikes on processing the queued events, leading to incorrect scale up events again).

One idea to solve this problem is to:

Score TaskManagers based on certain signals (# exceptions reported, exception types (connection losses, akka failures), failure frequencies, ...) and blacklist them accordingly.
Introduce a best-effort TaskManager unregistration mechanism: When a TaskManager receives a sigterm, it sends a final message to the JobManager saying "goodbye", and the JobManager can immediately remove the TM from its bookkeeping.

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-03-19-20-10-40-324.png
19/Mar/21 19:10
342 kB
Robert Metzger

Issue Links

is related to

FLINK-10407 FLIP-159: Reactive mode

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Robert Metzger

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 19/Mar/21 19:14

Updated:: 22/Feb/22 10:16

Resolved:: 22/Feb/22 10:16