Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
1.6.1
-
None
-
None
-
Flink 1.6.1 cluster with one taskmanager and one jobmanager, prometheus and grafana, all started in a local docker environment.
See sample project at: https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
Description
In my setup I am using the prometheus reporter and a custom implemented histogram metric. After a while the histogram starts throwing exceptions (because it is rather poorly implemented). This causes all metrics on the taskmanager where the histogram is running to stop being reported. By looking at the prometheus logs you can see that requests to taskmanager:9249/metrics will return an empty response when a metric is faulty.
Expected:
A faulty metrics implementation causes this specific metric to stop being reported
Actual:
A faulty metric will cause all metrics on that taskmanager to stop being reported