This issue essentially revives
CASSANDRA-8287, was resolved "Later" in 2015. While it was possible in principle at that time for read repair to break row isolation, that couldn't happen in practice because Cassandra always pulled all of the columns for each row in response to regular reads, so read repairs would never partially resolve a row. CASSANDRA-10657 modified Cassandra to only pull the requested columns for reads, which enabled read repair to break row isolation in practice.
Note also that this is distinct from CASSANDRA-14593 (for read repair breaking partition-level isolation): that issue (as we understand it) captures isolation being broken across multiple rows within an update to a partition, while this issue covers broken isolation across multiple columns within an update to a single row.
This behavior is easy to reproduce under affected versions using ccm:
This snippet creates a three-node cluster with an RF=3 keyspace containing a table with three columns: a partition key and two value columns. (Hinted handoff can mask the problem if the repro steps are executed in quick succession, so the snippet disables it for this exercise.) Then:
- It adds a full row to the table with values ('a', 1), ensuring it's replicated to all three nodes.
- It stops a node, then replaces the initial row with new values ('b', 2) in a single update, ensuring that it's replicated to both available nodes.
- It starts the node that was down, then stops one of the other nodes and performs a quorum read just for the letter column. The read observes 'b'.
- Finally, it stops the other node that observed the second update, then performs a CL=ONE read of the entire row on the node that was down for that update.
If read repair respects row isolation, then the final read should observe ('b', 2). (('a', 1) is also acceptable if we're willing to sacrifice monotonicity.)
- With VERSION=3.0.24, the final read observes ('b', 2), as expected.
- With VERSION=3.11.10 and VERSION=4.0-rc1, the final read instead observes ('b', 1). The same is true for 3.0.24 if
CASSANDRA-10657is backported to it.
The scenario above is somewhat contrived in that it supposes multiple read workflows consulting different sets of columns with different consistency levels. Under 3.11, asynchronous read repair makes this scenario possible even using just CL=ONE – and with speculative retry, even if read_repair_chance/dclocal_read_repair_chance are both zeroed. We haven't looked closely at 4.0, but even though (as we understand it) it lacks async read repair, scenarios like CL=ONE writes or failed, partially-committed CL>ONE writes create some surface area for this behavior, even without mixed consistency/column reads.
Given the importance of paging to reads from wide partitions, it makes some intuitive sense that applications shouldn't rely on isolation at the partition level. Being unable to rely on row isolation is much more surprising, especially given that (modulo the possibility of other atomicity bugs) Cassandra did preserve it before 3.11. Cassandra should either find a solution for this in code (e.g., when performing a read repair, always operate over all of the columns for the table, regardless of what was originally requested for a read) or at least update its documentation to include appropriate caveats about update isolation.