[FLINK-10521] Faulty Histogram stops Prometheus metrics from being reported - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: Runtime / Metrics
Labels:
None
Environment:

Hide

Flink 1.6.1 cluster with one taskmanager and one jobmanager, prometheus and grafana, all started in a local docker environment.

See sample project at: https://github.com/florianschmidt1994/flink-fault-tolerance-baseline

Show
Flink 1.6.1 cluster with one taskmanager and one jobmanager, prometheus and grafana, all started in a local docker environment. See sample project at: https://github.com/florianschmidt1994/flink-fault-tolerance-baseline

Description

In my setup I am using the prometheus reporter and a custom implemented histogram metric. After a while the histogram starts throwing exceptions (because it is rather poorly implemented). This causes all metrics on the taskmanager where the histogram is running to stop being reported. By looking at the prometheus logs you can see that requests to taskmanager:9249/metrics will return an empty response when a metric is faulty.

Expected:

A faulty metrics implementation causes this specific metric to stop being reported

Actual:

A faulty metric will cause all metrics on that taskmanager to stop being reported

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

taskmanager.log
11/Oct/18 08:50
375 kB
Florian Schmidt
prometheus.log
11/Oct/18 08:50
56 kB
Florian Schmidt
Screenshot 2018-10-10 at 11.32.59.png
10/Oct/18 09:33
36 kB
Florian Schmidt

Activity

People

Assignee:: Unassigned

Reporter:: Florian Schmidt

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Oct/18 09:24

Updated:: 13/Mar/20 10:29

Resolved:: 13/Mar/20 10:29