[CASSANDRA-16710] Read repairs can break row isolation - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Urgent
Resolution: Unresolved
Fix Version/s: 3.0.x, 3.11.x, 4.0.x
Component/s: Consistency/Coordination
Labels:
None

Bug Category:
Correctness
Severity:
Critical
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None

Description

This issue essentially revives ~~CASSANDRA-8287~~, was resolved "Later" in 2015. While it was possible in principle at that time for read repair to break row isolation, that couldn't happen in practice because Cassandra always pulled all of the columns for each row in response to regular reads, so read repairs would never partially resolve a row. ~~CASSANDRA-10657~~ modified Cassandra to only pull the requested columns for reads, which enabled read repair to break row isolation in practice.

Note also that this is distinct from CASSANDRA-14593 (for read repair breaking partition-level isolation): that issue (as we understand it) captures isolation being broken across multiple rows within an update to a partition, while this issue covers broken isolation across multiple columns within an update to a single row.

This behavior is easy to reproduce under affected versions using ccm:

ccm create -n 3 -v $VERSION rrtest
ccm updateconf -y 'hinted_handoff_enabled: false
max_hint_window_in_ms: 0'
ccm start
(cat <<EOF
CREATE KEYSPACE IF NOT EXISTS rrtest WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '3'};
CREATE TABLE IF NOT EXISTS rrtest.kv (key TEXT PRIMARY KEY, col1 TEXT, col2 INT);
CONSISTENCY ALL;
INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'a', 1);
EOF
) | ccm node1 cqlsh
ccm node3 stop
(cat <<EOF
CONSISTENCY QUORUM;
INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'b', 2);
EOF
) | ccm node1 cqlsh
ccm node3 start
ccm node2 stop
(cat <<EOF
CONSISTENCY QUORUM;
SELECT key, col1 FROM rrtest.kv WHERE key = 'key';
EOF
) | ccm node1 cqlsh
ccm node1 stop
(cat <<EOF
CONSISTENCY ONE;
SELECT * FROM rrtest.kv WHERE key = 'key';
EOF
) | ccm node3 cqlsh

This snippet creates a three-node cluster with an RF=3 keyspace containing a table with three columns: a partition key and two value columns. (Hinted handoff can mask the problem if the repro steps are executed in quick succession, so the snippet disables it for this exercise.) Then:

It adds a full row to the table with values ('a', 1), ensuring it's replicated to all three nodes.
It stops a node, then replaces the initial row with new values ('b', 2) in a single update, ensuring that it's replicated to both available nodes.
It starts the node that was down, then stops one of the other nodes and performs a quorum read just for the letter column. The read observes 'b'.
Finally, it stops the other node that observed the second update, then performs a CL=ONE read of the entire row on the node that was down for that update.

If read repair respects row isolation, then the final read should observe ('b', 2). (('a', 1) is also acceptable if we're willing to sacrifice monotonicity.)

With VERSION=3.0.24, the final read observes ('b', 2), as expected.
With VERSION=3.11.10 and VERSION=4.0-rc1, the final read instead observes ('b', 1). The same is true for 3.0.24 if ~~CASSANDRA-10657~~ is backported to it.

The scenario above is somewhat contrived in that it supposes multiple read workflows consulting different sets of columns with different consistency levels. Under 3.11, asynchronous read repair makes this scenario possible even using just CL=ONE – and with speculative retry, even if read_repair_chance/dclocal_read_repair_chance are both zeroed. We haven't looked closely at 4.0, but even though (as we understand it) it lacks async read repair, scenarios like CL=ONE writes or failed, partially-committed CL>ONE writes create some surface area for this behavior, even without mixed consistency/column reads.

Given the importance of paging to reads from wide partitions, it makes some intuitive sense that applications shouldn't rely on isolation at the partition level. Being unable to rely on row isolation is much more surprising, especially given that (modulo the possibility of other atomicity bugs) Cassandra did preserve it before 3.11. Cassandra should either find a solution for this in code (e.g., when performing a read repair, always operate over all of the columns for the table, regardless of what was originally requested for a read) or at least update its documentation to include appropriate caveats about update isolation.

Attachments

Issue Links

relates to

CASSANDRA-14593 Read-Repair breaks partition-level update atomicity

Open

CASSANDRA-8287 Row Level Isolation is violated by read repair

Resolved

CASSANDRA-10657 Re-enable/improve value skipping

Resolved

Read repairs can break row isolation

Details

Description

Attachments

Issue Links

Activity

People

Dates