[FLINK-10226] Latency metrics can choke job-manager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Runtime / Metrics
Labels:
None

Description

With Flink 1.5.0 my Apache Beam job was not runnable unless I turned off latencyTracking feature. That job generated huge amount of latency metrics + histogram aggregates which updating occupied job-manager too much and cluster did fall appart.

This was discussed on mailing list:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Flink-cluster-crashing-going-from-1-4-0-gt-1-5-3-td23941.html

The purpose of the ticket is reason about how to improve this and on which end. I am currently not sure what is the root cause:
a) Beam-To-Flink translation does generate too much of of "noise operators"
b) Flink does not handle latencyTracking well for large jobs

Attachments

Issue Links

relates to

FLINK-10484 New latency tracking metrics format causes metrics cardinality explosion

Closed

FLINK-10246 Harden and separate MetricQueryService

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jozef Vilcek

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Aug/18 12:34

Updated:: 09/May/19 08:10

Resolved:: 09/May/19 08:10