Affects Version/s: 0.9.0.1
Fix Version/s: None
Our Kafka installation runs with unclean leader election disabled, so brokers halt when they find that their message offset is ahead of the leader's offset for a topic. We had two brokers halt today with this issue. After much time spent digging through the logs, I believe the following timeline describes what occurred and points to a plausible hypothesis as to what happened.
- B1, B2, and B3 are replicas of a topic, all in the ISR. B2 is currently the leader, but B1 is the preferred leader. The controller runs on B3.
- B1 fails, but the controller does not detect the failure immediately.
- B2 receives a message from a producer and B3 fetches it to stay up to date. B2 has not accepted the message, because B1 is down and so has not acknowledged the message.
- The controller triggers a preferred leader election, making B1 the leader, and notifies all replicas.
- Very shortly afterwards (~200ms), B1's broker registration in ZooKeeper expires, so the controller reassigns B2 to be leader again and notifies all replicas.
- Because B3 is the controller, while B2 is on another box, B3 hears about both of these events before B2 hears about either. B3 truncates its log to the high water mark (before the pending message) and resumes fetching from B2.
- B3 fetches the pending message from B2 again.
- B2 learns that it has been displaced and then reelected, and truncates its log to the high water mark, before the pending message.
- The next time B3 tries to fetch from B2, it sees that B2 is missing the pending message and halts.
In this case, there was no data loss or inconsistency. I haven't fully thought through whether either would be possible, but it seems likely that they would be, especially if there had been multiple producers to this topic.
I'm not completely certain about this timeline, but this sequence of events appears to at least be possible. Looking a bit through the controller code, there doesn't seem to be anything that forces LeaderAndIsrRequest to be sent in a particular order. If someone with more knowledge of the code base believes this is incorrect, I'd be happy to post the logs and/or do some more digging.