Noticed that a kafka cluster will lose data when a leader for a partition has their zookeeper connection timeout.
Sequence of events:
- Say broker A leads a partition followed by brokers B and C
- A ZK node has a network issue, happens to be the node used by broker A. Lets say this happens at offset X
- Kafka Controller immediately selects broker C as the new partition leader
- Broker A does not timeout from zookeeper for another 4 seconds. Broker A still thinks it is the leader, presumably accepting producer writes.
- Broker A detects the ZK timeout and leaves the ISR.
- Broker A reconnects to ZK, rejoins cluster as follower for partition
- Broker A truncates log to some offset Y such that Y > X. Broker A proceeds to catch up normally and becomes an ISR
- ISRs for partition are now in an inconsistent state:
- Broker C has all offsets X through Y plus everything after
- Broker B has all offsets X through Y plus everything after
- Broker A has offsets up to X and after Y. Everything between X and Y IS MISSING
- Within 5 minutes, controller trigger preferred replica election making Broker A the new leader for partition (this is default behavior)
All consumers after step 9 will not receive any messages for offsets between X and Y.
The root problem here seems to be broker A truncates to offset Y when rejoining the cluster. It should be truncating further back to offset X to prevent data loss