We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see at least 3 different symptoms, all resulting on broker/controller lockups.
We are pretty sure that the triggering cause for all these symptoms are temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK nodes always very quickly reunite and build a Quorum after the situation clears, but the Kafka brokers (which run on then same Linux VMs) quite often show problems after this procedure.
I've seen 3 different kinds of problems (this is why I put "reproduce" in quotes, I can never predict what will happen)
- the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some reason (that's the problem I originally described)
- the brokers all re-register and re-elect a new controller. But that new controller does not fully work. For example it doesn't process partition reassignment requests and or does not transfer partition leadership after I kill a broker
- the previous controller gets "dead-locked" (it has 3-4 of the important controller threads in a lock) and hence does not perform any of it's controller duties. But it regards itsself still as the valid controller and is accepted by the other brokers
I'll try to describe each one of the problems in more detail below, and hope to be able to cleary separate them.
I'm able to provoke these problems in our DEV environment quite regularly using the following procedure
- make sure all ZK nodes and Kafka brokers are stable and reacting normally
- freeze 2 out of 3 ZK nodes with kill -STOP for some minutes
- let the Kafka broker running, of course they will start complaining to be unable to reach ZK
- thaw the processes with kill -CONT
- now all Kafka brokers get notified that their ZK session has expired, and they start to reorganize the cluster
In about 20% of the tests, I'm able to produce one of the symptoms above. I can not predict which one though. I'm varying this procedure sometimes by also freezing one Kafka broker (most often the controller), but until now I haven't been able to create a clear pattern or really force one specific symptom