[KAFKA-7865] Kafka Constant Consumer Errors for ~30 min after Network Blip - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.10.2.1
Fix Version/s: None
Component/s: consumer
Labels:
None

Description

We are running v0.10.2.1 Kafka on AWS backed by EBS with 10 brokers (5 zookeepers). A few days ago we had a network blip for ~30-45seconds. The interesting part was consumers coordinated by one of the brokers all kept getting error code 16 (NOT_COORDINATOR) for ~30-35 mins before eventually receiving the messages successfully.

The broker itself was up and running and the resource utilization was fine as well (in terms of CPU, memory, disk, etc). In addition the under replicated partitions and other things recovered within a minute and all the other CGs coordinated by other brokers were fine as well. The broker had errors during the blip (but just only during the blip like this - other brokers saw this as well but were just fine and recovered in ~a minute):

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Eventually after 30 mins it recovered but being a real-time messaging bus, 30 mins is not so real-time

Some of the questions we have is:
1. Why this was the only broker which was affected? Note: this was not the controller and this one didn't see any more n/w issues than the others.
2. What made it recover? This is because we didn't change anything or restart anything as well.
3. Why did the client retries never worked? The client was constantly retrying and kept getting the same error.
4. Why we didn't notice any error logs as well?
5. Is this is a known issue which is solved in the later releases?
6. What can we do mitigate this?

Are we running into something like this: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Note: Some of the other settings we have:
zookeeper.connection.timeout.ms=10000 // server.properties
zookeeper.connection.timeout.ms=6000 // consumer.properties

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Aravind Velamur Srinivasan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jan/19 00:40

Updated:: 24/Jan/19 00:40