Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7865

Kafka Constant Consumer Errors for ~30 min after Network Blip

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.10.2.1
    • None
    • consumer
    • None

    Description

      We are running v0.10.2.1 Kafka on AWS backed by EBS with 10 brokers (5 zookeepers). A few days ago we had a network blip for ~30-45seconds. The interesting part was consumers coordinated by one of the brokers all kept getting error code 16 (NOT_COORDINATOR) for ~30-35 mins before eventually receiving the messages successfully.

      The broker itself was up and running and the resource utilization was fine as well (in terms of CPU, memory, disk, etc). In addition the under replicated partitions and other things recovered within a minute and all the other CGs coordinated by other brokers were fine as well. The broker had errors during the blip (but just only during the blip like this - other brokers saw this as well but were just fine and recovered in ~a minute):

      org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
      

      Eventually after 30 mins it recovered but being a real-time messaging bus, 30 mins is not so real-time

      Some of the questions we have is:
      1. Why this was the only broker which was affected? Note: this was not the controller and this one didn't see any more n/w issues than the others.
      2. What made it recover? This is because we didn't change anything or restart anything as well.
      3. Why did the client retries never worked? The client was constantly retrying and kept getting the same error.
      4. Why we didn't notice any error logs as well?
      5. Is this is a known issue which is solved in the later releases?
      6. What can we do mitigate this?

      Are we running into something like this: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

      Note: Some of the other settings we have:
      zookeeper.connection.timeout.ms=10000 // server.properties
      zookeeper.connection.timeout.ms=6000 // consumer.properties

      Attachments

        Activity

          People

            Unassigned Unassigned
            araviinus Aravind Velamur Srinivasan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: