Affects Version/s: 3.5.8
Fix Version/s: None
Using a tool that I wrote for testing ZooKeeper, I discovered the following scenario which causes ZooKeeper to violate sequential consistency.
Initially, start an ensemble with 3 servers called A, B, and C, and initialize 2 znodes called /key0 and /key1 to 0. Stop all servers.
- Start A and B. Stop A and at the same time initiate setting /key1 to 101 on B. Stop B.
- Start A and B and stop them. In this step it seems that /key1 == 101 is successfully propagated to A.
- Start A and C. Initiate a conditional write on A: if /key1 == 101, set /key0 to 200. The write seems to be successful. Stop the servers.
- Start A, B, and C. Initiate a conditional write on B: if /key1 == 0, set /key1 to 301. Surprisingly, the write succeeds. Stop the servers.
Finally, start all servers and read the values of /key0 and /key1 on all servers. They will be 200 and 301.
Even if we assume that any write can fail, the set of possible values for /key0 and /key1 under sequential consistency consists of (0, 0), (0, 101), (200, 101), and (0, 301). The values (200, 301) should not be possible: if /key0 == 200, then setting /key1 to 101 must have succeeded. On the other hand, if /key1 == 301, then setting /key1 to 101 must have failed, as this write happens before reading /key1 == 0.
The cause of this bug is probably related to the cause of ZOOKEEPER-2832, which was reported 3 years ago and is still open. You will notice that the above scenario is similar to the scenario reported there. Indeed, my tool randomly explores similar scenarios with conditional and unconditional writes under random server crashes, in search for sequential consistency violations.
I have attached a patch with a test that reproduces this bug. The affected version is 3.5.8. I suspect that 3.6.1 is also affected, but unfortunately, I'm having trouble compiling that version.