Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
Consider the following scenario:
Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes up to offset 10. The leader then fails with the high watermark at offset 8. Replica B had caught up to offset 10 while replica C was at offset 8. Suppose that C is elected with epoch=2 and immediately writes records up to offset 10. However, it also fails before these records become committed and replica B gets elected and writes records
up to offset 12. The epoch cache on each replica will look like the following:
Replica A:
(epoch=1, start_offset=0)
Replica B:
(epoch=1, start_offset=0)
(epoch=3, start_offset=10)
Replica C:
(epoch=1, start_offset=0)
(epoch=2, start_offset=8)
Suppose C comes back online. It will attempt to fetch at offset=10 with last_fetched_epoch=3. The leader B will detect log divergence and will return truncation_offset=10. Replica C will truncate to offset 10 (a no-op) and retry the same fetch and will be stuck.
To fix this, I see two options:
Option 1: In the case that the truncation offset equals the fetch offset, we can instead return the previous epoch end offset. In this example, we would return truncation_offset=0. The downside is that this causes unnecessary truncation.
Option 2: Rather than returning only the truncation offset, we can have the leader return both the previous "diverging" epoch and its end offset. In this example, B would return diverging_epoch=1, end_offset=10. Replica C would then know
to truncate to offset 8.
The second option is what was initially specified in the Raft proposal, but we changed during the discussion because we were not thinking of this case and we thought the response could be simplified. My inclination is to restore the originally specified truncation logic.
Attachments
Issue Links
- links to