Description
There is a bug in the consumer rebalancing logic that makes a consumer not pull data from some partitions for a topic. It recovers only after the consumer group is restarted and doesn't hit this bug again.
Here is the observed behavior of the consumer when it hits the bug -
1. Consumer is consuming 2 topics with 1 partition each on 2 brokers
2. Broker 2 is bounced
3. Rebalancing operation triggers for topic_2, where the consumer decides to now consume data only from Broker 1 for topic_2
4. During the rebalancing operation, ZK has not yet deleted the /brokers/topics/topic_1/broker_2, so the consumer still decides to consumer from both brokers for topic_1
5. While restarting the fetchers, it tries to restart fetcher for broker 2 and throws a RuntimeException. Before this, it has successfully started fetcher for broker 1 and is consuming data from broker_1
6. This exception trickles all the way upto syncedRebalance API and the oldPartitionsPerTopicMap does not get updated to reflect that for topic_2, the consumer has now seen only broker_1. It still points to topic_2 -> broker_1, broker_2
7. Next rebalancing attempt gets triggered
8. By now, broker 2 is restarted and registered in zookeeper
9. For topic_2, the consumer tries to see if rebalancing needs to be done. Since it doesn't see a change in the cached topic partition map, it decides there is no need to rebalance.
10. It continues fetching only from broker_1