Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
Correctness - Unrecoverable Corruption / Loss
-
Critical
-
Normal
-
Workload Replay
-
All
-
None
-
Description
While verifying CASSANDRA-18364 I saw the following history violation in simulator
[junit-timeout] Testcase: simulationTest(org.apache.cassandra.simulator.test.ShortAccordSimulationTest)-.jdk1.8: Caused an ERROR [junit-timeout] Failed on seed 0xadaca81151490353 [junit-timeout] org.apache.cassandra.simulator.SimulationException: Failed on seed 0xadaca81151490353 [junit-timeout] Caused by: java.lang.AssertionError: History violations detected [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation.logAndThrow(PaxosSimulation.java:315) [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation.isDone(PaxosSimulation.java:278) [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation$2.hasNext(PaxosSimulation.java:249) [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:224) [junit-timeout] at org.apache.cassandra.simulator.paxos.AbstractPairOfSequencesPaxosSimulation.run(AbstractPairOfSequencesPaxosSimulation.java:297) [junit-timeout] at org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation.run(PairOfSequencesAccordSimulation.java:62) [junit-timeout] at org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:374) [junit-timeout] at org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:39) [junit-timeout] at org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:30) [junit-timeout] at org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:355) [junit-timeout] at org.apache.cassandra.simulator.paxos.AccordSimulationRunner.main(AccordSimulationRunner.java:76) [junit-timeout] at org.apache.cassandra.simulator.test.ShortAccordSimulationTest.simulationTest(ShortAccordSimulationTest.java:32) [junit-timeout] Suppressed: org.apache.cassandra.simulator.paxos.HistoryViolation: Inconsistent sequences on 1: [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, 68, 70, 71, 74, 76, 79, 80, 83, 85, 87, 87] vs [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, 68, 70, 71, 74, 76, 79, 80, 83, 85, 87]+90 [junit-timeout] at accord.verify.StrictSerializabilityVerifier$Register.updateSequence(StrictSerializabilityVerifier.java:607) [junit-timeout] at accord.verify.StrictSerializabilityVerifier$Register.access$100(StrictSerializabilityVerifier.java:576) [junit-timeout] at accord.verify.StrictSerializabilityVerifier.apply(StrictSerializabilityVerifier.java:825) [junit-timeout] at org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.lambda$close$0(StrictSerializabilityValidator.java:66) [junit-timeout] at org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.convertHistoryViolation(StrictSerializabilityValidator.java:89) [junit-timeout] at org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.access$200(StrictSerializabilityValidator.java:27) [junit-timeout] at org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.close(StrictSerializabilityValidator.java:66) [junit-timeout] at org.apache.cassandra.simulator.paxos.LoggingHistoryValidator$1.close(LoggingHistoryValidator.java:63) [junit-timeout] at org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.verify(PairOfSequencesAccordSimulation.java:218) [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:135) [junit-timeout] at org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.accept(PairOfSequencesAccordSimulation.java:171) [junit-timeout] at org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:83) [junit-timeout] at org.apache.cassandra.simulator.systems.SimulatedActionCallable$1.run(SimulatedActionCallable.java:47) [junit-timeout] at org.apache.cassandra.simulator.systems.InterceptingExecutor$InterceptingPooledExecutor$WaitingThread.lambda$new$1(InterceptingExecutor.java:317) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:750)
Adding logging to track message passing, reads, and writes, I have the following ordering
[isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.1:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) [isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.2:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) [isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) [CommandStore[2]:1] node1 2023-04-03 12:54:30,208 CS:[2] OP:0xea64a268 reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE verb:ACCORD_PREACCEPT_RSP)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,209 CS:[2] OP:0x9761fb36 reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_PREACCEPT_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,210 CS:[2] OP:0xe62230f2 reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE verb:ACCORD_PREACCEPT_RSP)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,211 CS:[2] OP:0xc5563e5d send(/127.0.0.1:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d send(/127.0.0.2:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d send(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,334 CS:[2] OP:0xf8562cfb reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_RECOVER_RSP)) [CommandStore[2]:1] node1 2023-04-03 12:54:30,338 CS:[2] OP:0xcfd2540f reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE verb:ACCORD_RECOVER_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,338 CS:[2] OP:0xc4cf5af8 reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_RECOVER_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE verb:ACCORD_RECOVER_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,343 CS:[2] OP:0xb60153ab reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_GET_DEPS_RSP)) [CommandStore[2]:1] node1 2023-04-03 12:54:30,344 CS:[2] OP:0xac20a1d6 reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE verb:ACCORD_GET_DEPS_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,345 CS:[2] OP:0xa73f7484 reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE verb:ACCORD_GET_DEPS_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 Performing read [CommandStore[2]:1] node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 Performing read [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; post [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; post [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; pre [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; pre [ReadStage:1] node3 2023-04-03 12:54:30,351 reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 Performing coordinated write [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node2 2023-04-03 12:54:30,360 send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 Performing read [CommandStore[2]:1] node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 Performing read [CommandStore[2]:1] node3 2023-04-03 12:54:30,364 CS:[2] OP:0x92e94460 Performing write [MutationStage:4] node3 2023-04-03 12:54:30,364 Performing write: pre [MutationStage:4] node3 2023-04-03 12:54:30,365 Performing write: post [ReadStage:1] node3 2023-04-03 12:54:30,365 Performing read; post [ReadStage:1] node3 2023-04-03 12:54:30,365 Performing read; pre [ReadStage:1] node3 2023-04-03 12:54:30,369 Performing read; post [ReadStage:1] node3 2023-04-03 12:54:30,369 Performing read; pre [ReadStage:1] node3 2023-04-03 12:54:30,369 reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP)) [CommandStore[2]:1] node1 2023-04-03 12:54:30,370 CS:[2] OP:0xa59dc286 Performing write [CommandStore[2]:1] node2 2023-04-03 12:54:30,374 CS:[2] OP:0xab0f3ca4 Performing write [MutationStage:1] node2 2023-04-03 12:54:30,374 Performing write: pre [MutationStage:1] node2 2023-04-03 12:54:30,375 Performing write: post [CommandStore[2]:1] node3 2023-04-03 12:54:30,376 Performing coordinated write [MutationStage:3] node1 2023-04-03 12:54:30,376 Performing write: pre [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.1:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.2:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) [CommandStore[2]:1] node3 2023-04-03 12:54:30,382 reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_APPLY_RSP)) [MutationStage:3] node1 2023-04-03 12:54:30,382 Performing write: post
(The transaction has a returning select and an auto-read, which is why there are double logs for reads)
Here we see the following timing
T00 node3 starts txn T01 node3 sends COMMIT T02 node2 starts recover T03 all nodes ack to the pending recover T04 node2 sends COMMIT T05 node3 performs reads needed for txn T06 node3 sends read results to node2 T07 node2 performs write locally and send APPLY T08 node3 performs write T09 node3 performs reads needed for txn T10 node3 send reads to node3 T11 node3 performs write and sends APPLY T12 node3 ACKs APPLY to node2
Given the fact the simulator got a response back and didn’t get a preempt, this implies that the original coordinator was able to complete the full transaction without issues and reply back, but the reads/writes were interleaved between node3 and node2 causing the second write to observe the first write