Let's say we have three replicas for a partition: 1, 2 ,and 3.
In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 replicates up to offset 50, but broker 3 is a little behind at offset 40. The high watermark is 40.
Suppose that broker 2 has a zk session expiration event, but fails to detect it or fails to reestablish a session (e.g. due to a bug like
KAFKA-6879), and it continues fetching from broker 1.
For whatever reason, broker 3 is elected the leader for epoch 1 beginning at offset 40. Broker 1 detects the leader change and truncates its log to offset 40. Some new data is appended up to offset 60, which is fully replicated to broker 1. Broker 2 continues fetching from broker 1 at offset 50, but gets NOT_LEADER_FOR_PARTITION errors, which is retriable and hence broker 2 will retry.
After some time, broker 1 becomes the leader again for epoch 2. Broker 1 begins at offset 60. Broker 2 has not exhausted retries and is now able to fetch at offset 50 and append the last 10 records in order to catch up. However, because it did not observed the leader changes, it never saw the need to truncate its log. Hence offsets 40-49 still reflect the uncommitted changes from epoch 0. Neither KIP-101 nor KIP-279 can fix this because the tail of the log is correct.
The basic problem is that zombie replicas are not fenced properly by the leader epoch. We actually observed a sequence roughly like this after a broker had partially deadlocked from
KAFKA-6879. We should add the leader epoch to fetch requests so that the leader can fence the zombie replicas.
A related problem is that we currently allow such zombie replicas to be added to the ISR even if they are in an offline state. The problem is that the controller will never elect them, so being part of the ISR does not give the availability guarantee that is intended. This would also be fixed by verifying replica leader epoch in fetch requests.