Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13572

Negative value for 'Preferred Replica Imbalance' metric

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.0
    • 3.3.0, 3.2.1
    • None
    • None

    Description

      A negative value (-822) for the metric - kafka_controller_kafkacontroller_preferredreplicaimbalancecount has been observed - please see the attached screenshot and the output below:

      $ curl -s http://localhost:9101/metrics | fgrep 'kafka_controller_kafkacontroller_preferredreplicaimbalancecount'
      # HELP kafka_controller_kafkacontroller_preferredreplicaimbalancecount Attribute exposed for management (kafka.controller<type=KafkaController, name=PreferredReplicaImbalanceCount><>Value)
      # TYPE kafka_controller_kafkacontroller_preferredreplicaimbalancecount gauge
      kafka_controller_kafkacontroller_preferredreplicaimbalancecount -822.0
      

      The issue has appeared after an operation where the number of partitions for some topics were increased, and some topics were deleted/created in order to decrease the number of their partitions.

      Ran the following command to check if there is/are any instance/s where the preferred leader (1st broker in the Replica list) is not the current Leader:

      % grep ".*Topic:.*Partition:.*Leader:.*Replicas:.*Isr:.*Offline:.*" kafka-topics_describe.out | awk '{print $6 " " $8}' | cut -d "," -f1 | awk '{print $0, ($1==$2?_:"NOT") "MATCHED"}'|grep NOT | wc -l
           0
      

      but could not find any such instances.

      leader.imbalance.per.broker.percentage=2 is set for all the brokers in the cluster which means that we are allowed to have an imbalance of up to 2% for preferred leaders. This seems to be a valid value, as such, this setting should not contribute towards a negative metric.

      The metric seems to be getting subtracted in the code here , however it is not clear when it can become -ve (i.e. subtracted more than added) in absence of any comments or debug/trace level logs in the code. However, one thing is for sure, you either have no imbalance (0) or have imbalance (> 0), it doesn’t make sense for the metric to be < 0.

      FWIW, no other anomalies besides this have been detected.

      Considering these metrics get actively monitored, we should look at adding DEBUG/TRACE logging around the addition/subtraction of these metrics (and elsewhere where appropriate) to identify any potential issues.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sahuja Siddharth Ahuja
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: