Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16710

Read repairs can break row isolation

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Correctness
    • Critical
    • Normal
    • User Report
    • All
    • None

    Description

      This issue essentially revives CASSANDRA-8287, was resolved "Later" in 2015. While it was possible in principle at that time for read repair to break row isolation, that couldn't happen in practice because Cassandra always pulled all of the columns for each row in response to regular reads, so read repairs would never partially resolve a row. CASSANDRA-10657 modified Cassandra to only pull the requested columns for reads, which enabled read repair to break row isolation in practice.

      Note also that this is distinct from CASSANDRA-14593 (for read repair breaking partition-level isolation): that issue (as we understand it) captures isolation being broken across multiple rows within an update to a partition, while this issue covers broken isolation across multiple columns within an update to a single row.

      This behavior is easy to reproduce under affected versions using ccm:

      ccm create -n 3 -v $VERSION rrtest
      ccm updateconf -y 'hinted_handoff_enabled: false
      max_hint_window_in_ms: 0'
      ccm start
      (cat <<EOF
      CREATE KEYSPACE IF NOT EXISTS rrtest WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '3'};
      CREATE TABLE IF NOT EXISTS rrtest.kv (key TEXT PRIMARY KEY, col1 TEXT, col2 INT);
      CONSISTENCY ALL;
      INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'a', 1);
      EOF
      ) | ccm node1 cqlsh
      ccm node3 stop
      (cat <<EOF
      CONSISTENCY QUORUM;
      INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'b', 2);
      EOF
      ) | ccm node1 cqlsh
      ccm node3 start
      ccm node2 stop
      (cat <<EOF
      CONSISTENCY QUORUM;
      SELECT key, col1 FROM rrtest.kv WHERE key = 'key';
      EOF
      ) | ccm node1 cqlsh
      ccm node1 stop
      (cat <<EOF
      CONSISTENCY ONE;
      SELECT * FROM rrtest.kv WHERE key = 'key';
      EOF
      ) | ccm node3 cqlsh
      

      This snippet creates a three-node cluster with an RF=3 keyspace containing a table with three columns: a partition key and two value columns. (Hinted handoff can mask the problem if the repro steps are executed in quick succession, so the snippet disables it for this exercise.) Then:

      1. It adds a full row to the table with values ('a', 1), ensuring it's replicated to all three nodes.
      2. It stops a node, then replaces the initial row with new values ('b', 2) in a single update, ensuring that it's replicated to both available nodes.
      3. It starts the node that was down, then stops one of the other nodes and performs a quorum read just for the letter column. The read observes 'b'.
      4. Finally, it stops the other node that observed the second update, then performs a CL=ONE read of the entire row on the node that was down for that update.

      If read repair respects row isolation, then the final read should observe ('b', 2). (('a', 1) is also acceptable if we're willing to sacrifice monotonicity.)

      • With VERSION=3.0.24, the final read observes ('b', 2), as expected.
      • With VERSION=3.11.10 and VERSION=4.0-rc1, the final read instead observes ('b', 1). The same is true for 3.0.24 if CASSANDRA-10657 is backported to it.

      The scenario above is somewhat contrived in that it supposes multiple read workflows consulting different sets of columns with different consistency levels. Under 3.11, asynchronous read repair makes this scenario possible even using just CL=ONE – and with speculative retry, even if read_repair_chance/dclocal_read_repair_chance are both zeroed. We haven't looked closely at 4.0, but even though (as we understand it) it lacks async read repair, scenarios like CL=ONE writes or failed, partially-committed CL>ONE writes create some surface area for this behavior, even without mixed consistency/column reads.

      Given the importance of paging to reads from wide partitions, it makes some intuitive sense that applications shouldn't rely on isolation at the partition level. Being unable to rely on row isolation is much more surprising, especially given that (modulo the possibility of other atomicity bugs) Cassandra did preserve it before 3.11. Cassandra should either find a solution for this in code (e.g., when performing a read repair, always operate over all of the columns for the table, regardless of what was originally requested for a read) or at least update its documentation to include appropriate caveats about update isolation.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            blerer Benjamin Lerer Assign to me
            sklock Samuel Klock
            Benjamin Lerer

            Dates

              Created:
              Updated:

              Slack

                Issue deployment