Affects Version/s: 2.2.0
Fix Version/s: None
We are using Kafka in production with 5 brokers and 3 zookeepers. We are running Kafka and zookeeper in Kubernetes and storage is managed by PVC using NFS. We are using topic with 60 partitions.
The cluster was running successfully from almost 50 days since the last restart. Last week (11/28) two brokers were down. Team is still researching for the root cause of broker failures.
Since we are using K8s the brokers came back up immediately (in less than 5minutes). But we have issue on the producer applications and consumer applications while downloading the metadata. Please check the attached images.
We have enabled debug logs for one of the applications and it seems like Kafka brokers are returning metadata with leader_epoch value of 0 where as in the client Metadata cache it was maintained at 6 for most of the partitions.
Eventually we are forced to restart all the producer apps (around 35-40 micro services) and they are all able to download the metadata since it's first time didn't face any issue and was able to produce the messages.
As part of troubleshooting, we have checked the zookeeper key/value pairs registered by Kafka and we can see that leader_epoch was set back to 0 for almost all partitions. And we have checked for another topic which is used by other apps, their leader_epoch was in good shape and ctime and mtime are also updated correctly. Please check the attached screenshots.
Please refer the stackoverflow issue that we have reported: