Hi there, we are evaluating Kafka 0.9 to find if it is stable enough and ready for production. We wrote a tool that basically verifies that each produced message is also properly consumed. We found the issue described below while stressing Kafka using this tool.
Adding more and more consumers to a consumer group may result in unsuccessful rebalancing. Data from one or more partitions are not consumed and are not effectively available to the client application (e.g. for 15 minutes). Situation can be resolved externally by touching the consumer group again (add or remove a consumer) which forces another rebalancing that may or may not be successful.
Significantly higher CPU utilization was observed in such cases (from about 3% to 17%). The CPU utilization takes place in both the affected consumer and Kafka broker according to htop and profiling using jvisualvm.
Jvisualvm indicates the issue may be related to
KAFKA-2936 (see its screenshots in the GitHub repo below), but I'm very unsure. I don't also know if the issue is in consumer or broker because both are affected and I don't know Kafka internals.
The issue is not deterministic but it can be easily reproduced after a few minutes just by executing more and more consumers. More parallelism with multiple CPUs probably gives the issue more chances to appear.
The tool itself together with very detailed instructions for quite reliable reproduction of the issue and initial analysis are available here:
- Prefer fixed tag issue1 to branch master which may change.
- Note there are also various screenshots of jvisualvm together with full logs from all components of the tool.
My colleague was able to independently reproduce the issue according to the instructions above. If you have any questions or if you need any help with the tool, just let us know. This issue is blocker for us.