Synchronization code may fail to truncate an uncommitted transaction in the follower’s transaction log. Here is a scenario:
Start the ensemble with three nodes A, B and C with C being the leader
The current epoch is 1
For simplicity of the example, let’s say zxid is a two digit number, with epoch being the first digit
Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
The zxid is 12 – 11 for creating key0 and 12 for creating key1. (For simplicity of the example, the zxid gets increased only by transactions directly changing the data of znodes.)
All the nodes have seen the change 12 and have persistently logged it
Shut down all
Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with zxid 21 is issued. The leader B writes it to the log but Node A is shutdown before writing it to the log. Then, the leader B is also shut down. The change 21 is applied only to B but not to A or C.
Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C (i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid is 12 for both Node A and C. So, they are in sync already. Node A sends an empty DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, A and C are shut down. Now, C has the higher zxid than Node B.
Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B (i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and creates snapshot.12 as the zxid 12 is the last processed zxid of the leader C. (Note the newly created snapshot on B is assigned the lower zxid then the change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 is issued. Both B and C apply the change 41 into their logs. (Note that now B and C have the same last processed zxid) Then, B and C are shut down.
Start Node B and C. Epoch becomes 5. Node B and C use their local log and snapshot files to restore their in-memory data tree. Node B has 1000 as the value of key0, because it’s latest valid snapshot is snapshot.12 and there was a later transaction with zxid 21 in its log. Yet, Node C has 0 as the value of key0, because the change 21 was never written on C. Node C is the leader. Node B and C have the same last processed zxid, i.e. 41. So, they are considered to be in sync already, and Node C sends an empty DIFF to Node B. So, the synchronization completes with the initially restored in-memory data tree on B and C.
The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The LearnerHandler.run on C at Step 3, never sends TRUNC but just SNAP. So, the change 21 was never truncated on B. Also, at step 4, since B uses the snapshot of the lower zxid to restore its in-memory data tree, the change 21 could get into the data tree. Then, the leader C at the step 4 did not send SNAP, because the change 41 made to both B and C makes the leader C think the B and C are already in sync. Thus, data inconsistency occurs.
The attached test case can deterministically reproduce the bug.