Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13572

Negative value for 'Preferred Replica Imbalance' metric

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.0
    • 3.3.0, 3.2.1
    • None
    • None

    Description

      A negative value (-822) for the metric - kafka_controller_kafkacontroller_preferredreplicaimbalancecount has been observed - please see the attached screenshot and the output below:

      $ curl -s http://localhost:9101/metrics | fgrep 'kafka_controller_kafkacontroller_preferredreplicaimbalancecount'
      # HELP kafka_controller_kafkacontroller_preferredreplicaimbalancecount Attribute exposed for management (kafka.controller<type=KafkaController, name=PreferredReplicaImbalanceCount><>Value)
      # TYPE kafka_controller_kafkacontroller_preferredreplicaimbalancecount gauge
      kafka_controller_kafkacontroller_preferredreplicaimbalancecount -822.0
      

      The issue has appeared after an operation where the number of partitions for some topics were increased, and some topics were deleted/created in order to decrease the number of their partitions.

      Ran the following command to check if there is/are any instance/s where the preferred leader (1st broker in the Replica list) is not the current Leader:

      % grep ".*Topic:.*Partition:.*Leader:.*Replicas:.*Isr:.*Offline:.*" kafka-topics_describe.out | awk '{print $6 " " $8}' | cut -d "," -f1 | awk '{print $0, ($1==$2?_:"NOT") "MATCHED"}'|grep NOT | wc -l
           0
      

      but could not find any such instances.

      leader.imbalance.per.broker.percentage=2 is set for all the brokers in the cluster which means that we are allowed to have an imbalance of up to 2% for preferred leaders. This seems to be a valid value, as such, this setting should not contribute towards a negative metric.

      The metric seems to be getting subtracted in the code here , however it is not clear when it can become -ve (i.e. subtracted more than added) in absence of any comments or debug/trace level logs in the code. However, one thing is for sure, you either have no imbalance (0) or have imbalance (> 0), it doesn’t make sense for the metric to be < 0.

      FWIW, no other anomalies besides this have been detected.

      Considering these metrics get actively monitored, we should look at adding DEBUG/TRACE logging around the addition/subtraction of these metrics (and elsewhere where appropriate) to identify any potential issues.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            sahuja Siddharth Ahuja
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment