The consumer maintains a cache of topics and partitions it owns along with the fetcher queues corresponding to those. But while releasing partition ownership, this cache is not cleared. This leads the consumer to release a partition that it does not own any more. This can also lead the consumer to commit offsets for partitions that it no longer consumes from.
The rebalance operation goes through following steps -
1. close fetchers
2. commit offsets
3. release partition ownership.
4. rebalance, add topic, partition and fetcher queues to the topic registry, for all topics that the consumer process currently wants to own.
5. If the consumer runs into conflict for one topic or partition, the rebalancing attempt fails, and it goes to step 1.
Say, there are 2 consumers in a group, c1 and c2. Both are consuming topic1 with partitions 0-0, 0-1 and 1-0. Say c1 owns 0-0 and 0-1 and c2 owns 1-0.
1. Broker 1 goes down. This triggers rebalancing attempt in c1 and c2.
2. c1's release partition ownership and during step 4 (above), fails to rebalance.
3. Meanwhile, c2 completes rebalancing successfully, and owns partition 0-1 and starts consuming data.
4. c1 starts next rebalancing attempt and during step 3 (above), it releases partition 0-1. During step 4, it owns partition 0-0 again, and starts consuming data.
5. Effectively, rebalancing has completed successfully, but there is no owner for partition 0-1 registered in Zookeeper.