[KAFKA-2978] Topic partition is not sometimes consumed after rebalancing of consumer group - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.9.0.0
Fix Version/s: 0.9.0.1
Component/s: consumer, core
Labels:
None

Flags:

Important

Description

Hi there, we are evaluating Kafka 0.9 to find if it is stable enough and ready for production. We wrote a tool that basically verifies that each produced message is also properly consumed. We found the issue described below while stressing Kafka using this tool.

Adding more and more consumers to a consumer group may result in unsuccessful rebalancing. Data from one or more partitions are not consumed and are not effectively available to the client application (e.g. for 15 minutes). Situation can be resolved externally by touching the consumer group again (add or remove a consumer) which forces another rebalancing that may or may not be successful.

Significantly higher CPU utilization was observed in such cases (from about 3% to 17%). The CPU utilization takes place in both the affected consumer and Kafka broker according to htop and profiling using jvisualvm.

Jvisualvm indicates the issue may be related to ~~KAFKA-2936~~ (see its screenshots in the GitHub repo below), but I'm very unsure. I don't also know if the issue is in consumer or broker because both are affected and I don't know Kafka internals.

The issue is not deterministic but it can be easily reproduced after a few minutes just by executing more and more consumers. More parallelism with multiple CPUs probably gives the issue more chances to appear.

The tool itself together with very detailed instructions for quite reliable reproduction of the issue and initial analysis are available here:

https://github.com/avast/kafka-tests
https://github.com/avast/kafka-tests/tree/issue1/issues/1_rebalancing
Prefer fixed tag issue1 to branch master which may change.
Note there are also various screenshots of jvisualvm together with full logs from all components of the tool.

My colleague was able to independently reproduce the issue according to the instructions above. If you have any questions or if you need any help with the tool, just let us know. This issue is blocker for us.

Attachments

Issue Links

is related to

KAFKA-2936 Socket server selector can stuck on one send in tight loop.

Resolved

Activity

People

Assignee:: Jason Gustafson

Reporter:: Michal Turek

Reviewer:: Guozhang Wang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Dec/15 16:01

Updated:: 14/Dec/15 22:54

Resolved:: 14/Dec/15 22:54