Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8656

Kafka Consumer Record Latency Metric



    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • metrics
    • None


      Consumer lag is a useful metric to monitor how many records are queued to be processed.  We can look at individual lag per partition or we may aggregate metrics. For example, we may want to monitor what the maximum lag of any particular partition in our consumer subscription so we can identify hot partitions, caused by an insufficient producing partitioning strategy.  We may want to monitor a sum of lag across all partitions so we have a sense as to our total backlog of messages to consume. Lag in offsets is useful when you have a good understanding of your messages and processing characteristics, but it doesn’t tell us how far behind in time we are.  This is known as wait time in queueing theory, or more informally it’s referred to as latency.

      The latency of a message can be defined as the difference between when that message was first produced to when the message is received by a consumer.  The latency of records in a partition correlates with lag, but a larger lag doesn’t necessarily mean a larger latency. For example, a topic consumed by two separate application consumer groups A and B may have similar lag, but different latency per partition.  Application A is a consumer which performs CPU intensive business logic on each message it receives. It’s distributed across many consumer group members to handle the load quickly enough, but since its processing time is slower, it takes longer to process each message per partition.  Meanwhile, Application B is a consumer which performs a simple ETL operation to land streaming data in another system, such as HDFS. It may have similar lag to Application A, but because it has a faster processing time its latency per partition is significantly less.

      If the Kafka Consumer reported a latency metric it would be easier to build Service Level Agreements (SLAs) based on non-functional requirements of the streaming system.  For example, the system must never have a latency of greater than 10 minutes. This SLA could be used in monitoring alerts or as input to automatic scaling solutions.






            seglo Sean Glover
            seglo Sean Glover
            7 Vote for this issue
            9 Start watching this issue