Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
2.4.0
-
None
-
None
-
Important
Description
I deployed three broker instances and suddenly found that the client was unable to consume data from certain topic partitions. I first tried to log in to the broker corresponding to the group and used the following command to view the consumer group:
./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9093 --describe --group mygroup
and found the following error:
Error: Executing consumer group command failed due to org.apache.kafka.common.errors.CoodinatorLoadInProgressException: The coodinator is loading and hence can't process requests.
I then discovered that the broker may be stuck in a loop, which is constantly in a loading state. At the same time, I found through the top command that the "group-metadata-manager-0" thread was constantly consuming 100% of the CPU resources. This loop could not be broken, resulting in the inability to consume topic partition data on that node. At this point, I suspected that the issue may be related to the __consumer_offsets partition data file loaded by this thread.
Finally, after restarting the broker instance, everything was back to normal. It's very strange that if there was an issue with the __consumer_offsets partition data file, the broker should have failed to start. Why was it able to automatically recover after a restart? And why did this continuous loop loading of the __consumer_offsets partition data occur?
We encountered this issue in our production environment using Kafka versions 2.2.1 and 2.4.0, and I believe it may also affect other versions.