Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15789

Rows can get duplicated in mixed major-version clusters and after full upgrade

    XMLWordPrintableJSON

    Details

      Description

      In a mixed 2.X/3.X major version cluster a sequence of row deletes, collection overwrites, paging, and read repair can cause 3.X nodes to split individual rows into several rows with identical clustering. This happens due to 2.X paging and RT semantics, and a 3.X LegacyLayout deficiency.

      To reproduce, set up a 2-node mixed major version cluster with the following table:

      CREATE TABLE distributed_test_keyspace.tlb (
          pk int,
          ck int,
          v map<text, text>,
          PRIMARY KEY (pk, ck)
      );
      

      1. Using either node as the coordinator, delete the row with ck=2 using timestamp 1

      DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
      

      2. Using either node as the coordinator, insert the following 3 rows:

      INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
      INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
      INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
      

      3. Flush the table on both nodes

      4. Using the 2.2 node as the coordinator, force read repar by querying the table with page size = 2:

      SELECT * FROM tbl;
      

      5. Overwrite the row with ck=2 using timestamp 5:

      INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
      

      6. Query the 3.0 node and observe the split row:

      cqlsh> select * from distributed_test_keyspace.tlb ;
      
       pk | ck | v
      ----+----+------------
        1 |  1 | {'e': 'f'}
        1 |  2 | {'g': 'h'}
        1 |  2 | {'k': 'l'}
        1 |  3 | {'i': 'j'}
      

      This happens because the read to query the second page ends up generating the following mutation for the 3.0 node:

      ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
                   ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
                          [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
                          [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
                   [2:v:63:false:1@3,])
      

      Which on 3.0 side gets incorrectly deserialized as

      Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[
        [distributed_test_keyspace.tbl] key=1 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 columns=[[] | [v]]
          Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, localDeletion=1588588821, [v[c]=d ts=3]
          Row[info=[ts=-9223372036854775808] del=deletedAt=1, localDeletion=1588588821 ]: ck=2 |
          Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, localDeletion=1588588821
      ])
      

      LegacyLayout correctly interprets a range tombstone whose start and finish collectionName values don't match as a wrapping fragment of a legacy row deletion that's being interrupted by a collection deletion (correctly) - see code. Quoting the comment inline:

      // Because of the way RangeTombstoneList work, we can have a tombstone where only one of
      // the bound has a collectionName. That happens if we have a big tombstone A (spanning one
      // or multiple rows) and a collection tombstone B. In that case, RangeTombstoneList will
      // split this into 3 RTs: the first one from the beginning of A to the beginning of B,
      // then B, then a third one from the end of B to the end of A. To make this simpler, if
       // we detect that case we transform the 1st and 3rd tombstone so they don't end in the middle
       // of a row (which is still correct).
      

      LegacyLayout#addRowTombstone() method then chokes when it encounters such a tombstone in the middle of an existing row - having seen v[c]=d first, and mistakenly starts a new row, while in the middle of an existing one: (see code).

        Attachments

          Activity

            People

            • Assignee:
              marcuse Marcus Eriksson
              Reporter:
              aleksey Aleksey Yeschenko
              Authors:
              Aleksey Yeschenko, Marcus Eriksson, Sam Tunnicliffe
              Reviewers:
              Alex Petrov, Marcus Eriksson, Sylvain Lebresne
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: