[KAFKA-2082] Kafka Replication ends up in a bad state - ASF JIRA

Details

Type: Bug
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: 0.8.2.1
Fix Version/s: None
Component/s: replication
Labels:
- reliability
- zkclient-problems

Description

While running integration tests for Sarama (the go client) we came across a pattern of connection losses that reliably puts kafka into a bad state: several of the brokers start spinning, chewing ~30% CPU and spamming the logs with hundreds of thousands of lines like:

[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition [many_partition,1] failed due to Leader not local for partition [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition [many_partition,6] failed due to Leader not local for partition [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,21] failed due to Leader not local for partition [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,26] failed due to Leader not local for partition [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,1] failed due to Leader not local for partition [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,6] failed due to Leader not local for partition [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition [many_partition,21] failed due to Leader not local for partition [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition [many_partition,26] failed due to Leader not local for partition [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)

This can be easily and reliably reproduced using the toxiproxy-final branch of https://github.com/Shopify/sarama which includes a vagrant script for provisioning the appropriate cluster:

git clone https://github.com/Shopify/sarama.git
git checkout test-jira-kafka-2082
vagrant up
TEST_SEED=1427917826425719059 DEBUG=true go test -v

After the test finishes (it fails because the cluster ends up in a bad state), you can log into the cluster machine with vagrant ssh and inspect the bad nodes. The vagrant script provisions five zookeepers and five brokers in /opt/kafka-9091/ through /opt/kafka-9095/.

Additional context: the test produces continually to the cluster while randomly cutting and restoring zookeeper connections (all connections to zookeeper are run through a simple proxy on the same vm to make this easy). The majority of the time this works very well and does a good job exercising our producer's retry and failover code. However, under certain patterns of connection loss (the TEST_SEED in the instructions is important), kafka gets confused. The test never cuts more than two connections at a time, so zookeeper should always have quorum, and the topic (with three replicas) should always be writable.

Completely restarting the cluster via vagrant reload seems to put it back into a sane state.

Kafka Replication ends up in a bad state

Details

Description

Attachments

Attachments

Activity

People

Dates