[FLINK-22664] Task metrics are not properly unregistered during region failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.11.0, 1.12.0
Fix Version/s: None
Component/s: Runtime / Metrics
Labels:
None

Description

In the current implementation of AbstractPrometheusReporter, metrics with the same scopedMetricName share the same metric Collector. At the same time, a HashMap named collectorsWithCountByMetricName is maintained to record the refrence counter of each Collector. Only when the refrence counter of one Collector becomes 0, it will be unregistered.

Suppose we have a flink job with single chained operator, and execution failover-strategy is set to region.

The following figure compares the number of metrics when this job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region failover.

Each inflection point on the graph represents a region failover. For TaskManager with multiple tasks(slots), the number of metrics increases after region failover.

This is a case I deliberately constructed to illustrate this problem. TaskManager only needs to restart part of the tasks during each region failover, that is to say, the refrence counter of task's metric Collector will never become 0, so the metric Collector will not be unregistered.

This problem has brought a lot of pressure to our Prometheus, please see if there is a good solution.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2021-05-14 at 5.40.22 PM.png
14/May/21 09:41
365 kB
Guokuai Huang
Screen Shot 2021-05-14 at 2.51.04 PM.png
14/May/21 08:04
27 kB
Guokuai Huang

Activity

People

Assignee:: Unassigned

Reporter:: Guokuai Huang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/May/21 08:20

Updated:: 10/Jun/21 10:11

Resolved:: 17/May/21 04:08