Affects Version/s: 3.5.8
Fix Version/s: None
It is possible that a leader may have incomplete commitlog because it resync'ed with the old leader via SNAPSHOT replication.
Then, a follower may try to resync with the leader, but because there may be some transactions the follower missed earlier and the leader does not have in its commitlog.
They decided to use txnlog + commitlog to resync. However, this will lead to convergence failure because the leader does not send the missing transactions that are not in its commitlog.
Here is the abstract step to reproduce the bug, and I attached the patch with the test case that can reproduce the bug.
Initially, node A,B,C are all sync'ed.
1. Node A crashes; setData 0x11 on B and C
2. Node B and C crash
3. Node A and B restart
4. Node A crashes; setData 0x21 on B
5. Node B crashes
6. Node B and C restart
7. Node C crashes; setData 0x32 on B
8. Node A and C restart
9. Node B restarts
At step 6, C is a follower getting a snapshot from B, and C does not have the transaction 0x21 in its commitlog (only in the snapshot).
At step 8, C is the leader which does not have 0x21 in its commitlog, which A never gets.
In the end, 0x21 only exists on B and C, but not on A.
I think the solution would be made to LearnerHandler's syncFollower method as follows:
1. Check the last transaction it has in its txnlog + commitlog
2. If it is more recent than what it has in its txnlog + commitlog, then it should use Snapshot
3. Otherwise, continue with txnlog + commitlog replication
I attached a patch containing the proposed fix.