Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22664

Task metrics are not properly unregistered during region failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.11.0, 1.12.0
    • None
    • Runtime / Metrics
    • None

    Description

      In the current implementation of AbstractPrometheusReporter, metrics with the same scopedMetricName share the same metric Collector. At the same time, a HashMap named collectorsWithCountByMetricName is maintained to record the refrence counter of each Collector. Only when the refrence counter of one Collector becomes 0, it will be unregistered. 

      Suppose we have a flink job with single chained operator, and execution failover-strategy is set to region.

      The following figure compares the number of metrics when this job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region failover.

      Each inflection point on the graph represents a region failover. For TaskManager with multiple tasks(slots), the number of metrics increases after region failover.

      This is a case I deliberately constructed to illustrate this problem. TaskManager only needs to restart part of the tasks during each region failover, that is to say, the refrence counter of task's metric Collector will never become 0, so the metric Collector will not be unregistered.

      This problem has brought a lot of pressure to our Prometheus, please see if there is a good solution.

       

      Attachments

        1. Screen Shot 2021-05-14 at 5.40.22 PM.png
          365 kB
          Guokuai Huang
        2. Screen Shot 2021-05-14 at 2.51.04 PM.png
          27 kB
          Guokuai Huang

        Activity

          People

            Unassigned Unassigned
            guokuai.huang Guokuai Huang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: