Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22664

Task metrics are not properly unregistered during region failover



    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.11.0, 1.12.0
    • Fix Version/s: None
    • Component/s: Runtime / Metrics
    • Labels:


      In the current implementation of AbstractPrometheusReporter, metrics with the same scopedMetricName share the same metric Collector. At the same time, a HashMap named collectorsWithCountByMetricName is maintained to record the refrence counter of each Collector. Only when the refrence counter of one Collector becomes 0, it will be unregistered. 

      Suppose we have a flink job with single chained operator, and execution failover-strategy is set to region.

      The following figure compares the number of metrics when this job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region failover.

      Each inflection point on the graph represents a region failover. For TaskManager with multiple tasks(slots), the number of metrics increases after region failover.

      This is a case I deliberately constructed to illustrate this problem. TaskManager only needs to restart part of the tasks during each region failover, that is to say, the refrence counter of task's metric Collector will never become 0, so the metric Collector will not be unregistered.

      This problem has brought a lot of pressure to our Prometheus, please see if there is a good solution.



        1. Screen Shot 2021-05-14 at 5.40.22 PM.png
          365 kB
          Guokuai Huang
        2. Screen Shot 2021-05-14 at 2.51.04 PM.png
          27 kB
          Guokuai Huang



            • Assignee:
              guokuai.huang Guokuai Huang
            • Votes:
              0 Vote for this issue
              3 Start watching this issue


              • Created: