We have observed an edge case in the replication failover logic which can cause a replica to permanently fall out of sync with the leader or, in the worst case, actually have localized divergence between logs. This occurs in spite of the improved truncation logic from KIP-101.
Suppose we have brokers A and B. Initially A is the leader in epoch 1. It appends two batches: one in the range (0, 10) and the other in the range (11, 20). The first one successfully replicates to B, but the second one does not. In other words, the logs on the brokers look like this:
Broker A then has a zk session expiration and broker B is elected with epoch 2. It appends a new batch with offsets (11, n) to its local log. So we now have this:
Normally we expect broker A to truncate to offset 11 on becoming the follower, but before it is able to do so, broker B has its own zk session expiration and broker A again becomes leader, now with epoch 3. It then appends a new entry in the range (21, 30). The updated logs look like this:
Now what happens next depends on the last offset of the batch appended in epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch request to broker A with epoch 2. Broker A will respond that epoch 2 ends at offset 21. There are three cases:
1) n < 20: In this case, broker B will not do any truncation. It will begin fetching from offset n, which will ultimately cause an out of order offset error because broker A will return the full batch beginning from offset 11 which broker B will be unable to append.
2) n == 20: Again broker B does not truncate. It will fetch from offset 21 and everything will appear fine though the logs have actually diverged.
3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in the middle of the batch, it will truncate all the way to offset 10. It can begin fetching from offset 11 and everything is fine.
The case we have actually seen is the first one. The second one would likely go unnoticed in practice and everything is fine in the third case. To workaround the issue, we deleted the active segment on the replica which allowed it to re-replicate consistently from the leader.
I'm not sure the best solution for this scenario. Maybe if the leader isn't aware of an epoch, it should always respond with UNDEFINED_EPOCH_OFFSET instead of using the offset of the next highest epoch. That would cause the follower to truncate using its high watermark. Or perhaps instead of doing so, it could send another OffsetForLeaderEpoch request at the next previous cached epoch and then truncate using that.