Kafka
  1. Kafka
  2. KAFKA-1100

metrics shouldn't have generation/timestamp specific names

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.2.1
    • Component/s: None
    • Labels:
      None

      Description

      I've noticed that there are several metrics that seem useful for monitoring overtime, but which contain generational timestamps in the metric name.

      We are using yammer metrics libraries to send metrics data in a background thread every 10 seconds (to kafka actually), and then they eventually end up in a metrics database (graphite, opentsdb). The metrics then get graphed via UI, and we can see metrics going way back, etc.

      Unfortunately, many of the metrics coming from kafka seem to have metric names that change any time the server or consumer is restarted, which makes it hard to easily create graphs over long periods of time (spanning app restarts).

      For example:

      names like:
      kafka.consumer.FetchRequestAndResponseMetrics....square-1371718712833-e9bb4d10-0-508818741-AllBrokersFetchRequestRateAndTimeMs

      or:
      kafka.consumer.ZookeeperConsumerConnector...topicName.....square-1373476779391-78aa2e83-0-FetchQueueSize

      In our staging environment, we have our servers on regular auto-deploy cycles (they restart every few hours). So just not longitudinally usable to have metric names constantly changing like this.

      Is there something that can easily be done? Is it really necessary to have so much cryptic info in the metric name?

        Issue Links

          Activity

          Hide
          Manikumar Reddy added a comment -

          This got fixed in KAFKA-1481. Hence closing the issue.

          Show
          Manikumar Reddy added a comment - This got fixed in KAFKA-1481 . Hence closing the issue.
          Hide
          Vladimir Tretyakov added a comment - - edited

          Right Otis Gospodnetic, names will be:

          kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs,clientId=af_servers,allBrokers=true
          
          kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs,clientId=af_servers,brokerHost=wawanawna,brokerPort=9092
          
          kafka.consumer:type=ZookeeperConsumerConnector,name=FetchQueueSize,clientId=af_servers,topic=spm_topic,threadId=0
          
          Show
          Vladimir Tretyakov added a comment - - edited Right Otis Gospodnetic , names will be: kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs,clientId=af_servers,allBrokers= true kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs,clientId=af_servers,brokerHost=wawanawna,brokerPort=9092 kafka.consumer:type=ZookeeperConsumerConnector,name=FetchQueueSize,clientId=af_servers,topic=spm_topic,threadId=0
          Hide
          Otis Gospodnetic added a comment -

          I didn't check Kafka MBeans after the latest KAFKA-1481 patch, but I think they no longer contain generation/timestamps in them, right Vladimir Tretyakov? If that's true, then maybe KAFKA-1481 will fix this issue, too?

          Show
          Otis Gospodnetic added a comment - I didn't check Kafka MBeans after the latest KAFKA-1481 patch, but I think they no longer contain generation/timestamps in them, right Vladimir Tretyakov ? If that's true, then maybe KAFKA-1481 will fix this issue, too?
          Hide
          Otis Gospodnetic added a comment -

          Uhuh, long time
          There are other similar issues with metrics/beans, like http://search-hadoop.com/m/4TaT4lonIW&subj=How+to+parse+some+of+JMX+Bean+s+names which looks almost like a blocker for anyone trying to implement a Kafka monitoring tool that everyone could use (i.e. even those who have servers, topics, etc. with dashes in their names).

          Show
          Otis Gospodnetic added a comment - Uhuh, long time There are other similar issues with metrics/beans, like http://search-hadoop.com/m/4TaT4lonIW&subj=How+to+parse+some+of+JMX+Bean+s+names which looks almost like a blocker for anyone trying to implement a Kafka monitoring tool that everyone could use (i.e. even those who have servers, topics, etc. with dashes in their names).
          Hide
          Jun Rao added a comment -

          That's probably 3-4 months away.

          Show
          Jun Rao added a comment - That's probably 3-4 months away.
          Hide
          Jason Rosenberg added a comment -

          what's the timeline for 0.9?

          Show
          Jason Rosenberg added a comment - what's the timeline for 0.9?
          Hide
          Jun Rao added a comment -

          We have started developing the new consumer (for 0.9). Perhaps we should just fix the metric issue in the new consumer.

          Show
          Jun Rao added a comment - We have started developing the new consumer (for 0.9). Perhaps we should just fix the metric issue in the new consumer.
          Hide
          Otis Gospodnetic added a comment - - edited

          Jun Rao Any chance we could set Fix Version to 0.8.2 for this one?

          Show
          Otis Gospodnetic added a comment - - edited Jun Rao Any chance we could set Fix Version to 0.8.2 for this one?
          Hide
          Otis Gospodnetic added a comment - - edited

          I'm interested, too, so we can add Kafka 0.8 metrics support to SPM for Kafka.

          Could this go in 0.8.2 by any chance?

          Show
          Otis Gospodnetic added a comment - - edited I'm interested, too, so we can add Kafka 0.8 metrics support to SPM for Kafka. Could this go in 0.8.2 by any chance?
          Hide
          Jason Rosenberg added a comment -

          status?

          Show
          Jason Rosenberg added a comment - status?
          Hide
          Jason Rosenberg added a comment -

          can we think about this for 0.8.1?

          Show
          Jason Rosenberg added a comment - can we think about this for 0.8.1?
          Hide
          Jun Rao added a comment -

          Yes, I think this is too late for 0.8 final.

          Show
          Jun Rao added a comment - Yes, I think this is too late for 0.8 final.
          Hide
          Otis Gospodnetic added a comment -

          This sounds like something that would be good to have in 0.8 final, though it may be too late for that?

          Show
          Otis Gospodnetic added a comment - This sounds like something that would be good to have in 0.8 final, though it may be too late for that?
          Hide
          Swapnil Ghike added a comment -

          That makes sense Joel, we could also use the clientId to differentiate between two consumerConnectors that start up on the same host with the same group.

          Show
          Swapnil Ghike added a comment - That makes sense Joel, we could also use the clientId to differentiate between two consumerConnectors that start up on the same host with the same group.
          Hide
          Joel Koshy added a comment -

          That's a good point - we don't need it to be that way. The metric names that you referred to are derived from the consumer's registration in zookeeper. There are a couple of cleanup tasks we need to do for mbeans especially wrt consumers:

          • The names need not include timestamps. The reason we have timestamps and a hash in there is if you were to bring up two consumers under the same group on the same host at nearly the same time their registration would collide in zookeeper. Realistically this is something that only happens in system tests so it should be fine to drop the timestamp and hash for metrics registration.
          • Metrics are not de-registered on a rebalance/shutdown. I think there is already a jira for the shutdown case, but I'm compiling a list of other shortcomings and will file an umbrella jira to cover most of these issues.
          • I think the deregistration issues affect replica fetchers as well (need to check). i.e., if a broker transitions from a follower
            to leader for a partition the follower metrics for that partition need to be de-registered.
          Show
          Joel Koshy added a comment - That's a good point - we don't need it to be that way. The metric names that you referred to are derived from the consumer's registration in zookeeper. There are a couple of cleanup tasks we need to do for mbeans especially wrt consumers: The names need not include timestamps. The reason we have timestamps and a hash in there is if you were to bring up two consumers under the same group on the same host at nearly the same time their registration would collide in zookeeper. Realistically this is something that only happens in system tests so it should be fine to drop the timestamp and hash for metrics registration. Metrics are not de-registered on a rebalance/shutdown. I think there is already a jira for the shutdown case, but I'm compiling a list of other shortcomings and will file an umbrella jira to cover most of these issues. I think the deregistration issues affect replica fetchers as well (need to check). i.e., if a broker transitions from a follower to leader for a partition the follower metrics for that partition need to be de-registered.
          Hide
          Jason Rosenberg added a comment -

          Hi Swapnil,

          Unfortunately, we aren't using mbeans, but using the yammer MetricsRegistry and a reporter class based on com.yammer.metrics.reporting.AbstractPollingReporter.

          This creates a background thread that wakes up every 10 seconds and transmits all the metrics in the registry (which is essentially all the yammer metric mbeans) and sends them as messages over kafka. We then have time series db's which store these. Unfortunately, the ts db's are not sophisticated enough to allow wild-card querying....

          Is there a fundamental reason for having those cryptic metric names?

          Show
          Jason Rosenberg added a comment - Hi Swapnil, Unfortunately, we aren't using mbeans, but using the yammer MetricsRegistry and a reporter class based on com.yammer.metrics.reporting.AbstractPollingReporter. This creates a background thread that wakes up every 10 seconds and transmits all the metrics in the registry (which is essentially all the yammer metric mbeans) and sends them as messages over kafka. We then have time series db's which store these. Unfortunately, the ts db's are not sophisticated enough to allow wild-card querying.... Is there a fundamental reason for having those cryptic metric names?
          Hide
          Swapnil Ghike added a comment -

          Hi Jason, at LinkedIn, we use wildcards/regexes to create graphs from such mbeans. Would you be able to do something similar?

          Show
          Swapnil Ghike added a comment - Hi Jason, at LinkedIn, we use wildcards/regexes to create graphs from such mbeans. Would you be able to do something similar?

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rosenberg
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development