[KAFKA-8151] Broker hangs and lockups after Zookeeper outages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: controller, core, zkclient
Labels:
None

Description

We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK nodes always very quickly reunite and build a Quorum after the situation clears, but the Kafka brokers (which run on then same Linux VMs) quite often show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in quotes, I can never predict what will happen)

the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some reason (that's the problem I originally described)
the brokers all re-register and re-elect a new controller. But that new controller does not fully work. For example it doesn't process partition reassignment requests and or does not transfer partition leadership after I kill a broker
the previous controller gets "dead-locked" (it has 3-4 of the important controller threads in a lock) and hence does not perform any of it's controller duties. But it regards itsself still as the valid controller and is accepted by the other brokers

I'll try to describe each one of the problems in more detail below, and hope to be able to cleary separate them.

I'm able to provoke these problems in our DEV environment quite regularly using the following procedure

make sure all ZK nodes and Kafka brokers are stable and reacting normally
freeze 2 out of 3 ZK nodes with kill -STOP for some minutes
let the Kafka broker running, of course they will start complaining to be unable to reach ZK
thaw the processes with kill -CONT
now all Kafka brokers get notified that their ZK session has expired, and they start to reorganize the cluster

In about 20% of the tests, I'm able to produce one of the symptoms above. I can not predict which one though. I'm varying this procedure sometimes by also freezing one Kafka broker (most often the controller), but until now I haven't been able to create a clear pattern or really force one specific symptom

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

symptom3_lxgurten_kafka_dump1.txt
23/Mar/19 09:52
88 kB
Joe Ammann
symptom3_lxgurten_kafka_dump2.txt
23/Mar/19 09:52
89 kB
Joe Ammann
symptom3_lxgurten_kafka_dump3.txt
23/Mar/19 09:52
88 kB
Joe Ammann

Activity

People

Assignee:: Unassigned

Reporter:: Joe Ammann

Votes:: 3 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 23/Mar/19 09:30

Updated:: 07/Apr/19 14:12