If the follower's last appended epoch is ahead of the leader's last appended epoch, the OffsetsForLeaderEpoch response will incorrectly send (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET), and the follower will truncate to HW. This may lead to data loss in some rare cases where 2 back-to-back leader elections happen (failure of one leader, followed by quick re-election of the next leader due to preferred leader election, so that all replicas are still in the ISR, and then failure of the 3rd leader).
The bug is in LeaderEpochFileCache.endOffsetFor(), which returns (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET) if the requested leader epoch is ahead of the last leader epoch in the cache. The method should return (last leader epoch in the cache, LEO) in this scenario.
We don't create an entry in a leader epoch cache until a message is appended with the new leader epoch. Every append to log calls LeaderEpochFileCache.assign(). However, it would be much cleaner if `makeLeader` created an entry in the cache as soon as replica becomes a leader, which will fix the bug. In case the leader never appends any messages, and the next leader epoch starts with the same offset, we already have clearAndFlushLatest() that clears entries with start offsets greater or equal to the passed offset. LeaderEpochFileCache.assign() could be merged with clearAndFlushLatest(), so that we clear cache entries with offsets equal or greater than the start offset of the new epoch, so that we do not need to call these methods separately.
Here is an example of a scenario where the issue leads to the data loss.
Suppose we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2, r3) and the leader is r1. The data up to offset 10 has been committed to the ISR. Here is the initial state:
Replica 1 fails and leaves the ISR, which makes Replica 2 the new leader with leader epoch = 1. The leader appends a batch, but it is not replicated yet to the followers.
Replica 3 is elected a leader (due to preferred leader election) before it has a chance to truncate, with leader epoch 2.
Replica 2 sends OffsetsForLeaderEpoch(leader epoch = 1) to Replica 3. Replica 3 incorrectly replies with UNDEFINED_EPOCH_OFFSET, and Replica 2 truncates to HW. If Replica 3 fails before Replica 2 re-fetches the data, this may lead to data loss.