We recently introduced integration tests in Connect. This test spins up one or more Connect workers along with a Kafka broker and Zk in a single process and attempts to move records using a Connector. In the Example Integration Test, we spin up three workers each hosting a Connector task that consumes records from a Kafka topic. When the connector starts up, it may go through multiple rounds of rebalancing. We notice the following two problems in the last few days:
- After members join a group, there are no pendingMembers remaining, but the join group method does not complete, and send these members a signal that they are not ready to start consuming from their respective partitions.
- Because of quick rebalances, a consumer might have started a group, but Connect starts a rebalance, after we which we create three new instances of the consumer (one from each worker/task). But the group coordinator seems to have 4 members in the group. This causes the JoinGroup to indefinitely stall.
Even though this ticket is described in the connect of Connect, it may be applicable to general consumers.