We're running Kafka 2.4 and facing a pretty strange situation.
Let's say there were three brokers in the cluster 0, 1, and 2. Then:
1. Broker 3 was added.
2. Partitions were reassigned from broker 0 to broker 3.
3. Broker 0 was shut down (not gracefully) and removed from the cluster.
4. We see the following state in ZooKeeper:
It means, the dead broker 0 remains in the partitions's ISR. A big share of the partitions in the cluster have this issue.
This is actually causing an errors:
It means that effectively isr-expiration task is not working any more.
I have a suspicion that this was introduced by this commit (line selected)
Unfortunately, I haven't been able to reproduce this in isolation.
Any hints about how to reproduce (so I can write a patch) or mitigate the issue on a running cluster are welcome.
Generally, I assume that not throwing ReplicaNotAvailableException on a dead (i.e. non-existent) broker, considering them out-of-sync and removing from the ISR should fix the problem.