Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-11349

MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 2.1.16, 2.2.8
    • None
    • Normal

    Description

      We observed that repair, for some of our clusters, streamed a lot of data and many partitions were "out of sync".
      Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really high.

      After investigation, it appears that, if two range tombstones exists for a partition for the same range/interval, they're both included in the merkle tree computation.
      But, if for some reason, on another node, the two range tombstones were already compacted into a single range tombstone, this will result in a merkle tree difference.
      Currently, this is clearly bad because MerkleTree differences are dependent on compactions (and if a partition is deleted and created multiple times, the only way to ensure that repair "works correctly"/"don't overstream data" is to major compact before each repair... which is not really feasible).

      Below is a list of steps allowing to easily reproduce this case:

      ccm create test -v 2.1.13 -n 2 -s
      ccm node1 cqlsh
      CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
      USE test_rt;
      CREATE TABLE IF NOT EXISTS table1 (
          c1 text,
          c2 text,
          c3 float,
          c4 float,
          PRIMARY KEY ((c1), c2)
      );
      INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
      DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
      ctrl ^d
      # now flush only one of the two nodes
      ccm node1 flush 
      ccm node1 cqlsh
      USE test_rt;
      INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
      DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
      ctrl ^d
      ccm node1 repair
      # now grep the log and observe that there was some inconstencies detected between nodes (while it shouldn't have detected any)
      ccm node1 showlog | grep "out of sync"
      

      Consequences of this are a costly repair, accumulating many small SSTables (up to thousands for a rather short period of time when using VNodes, the time for compaction to absorb those small files), but also an increased size on disk.

      Attachments

        1. 11349-2.2-v4.patch
          11 kB
          Stefan Podkowinski
        2. 11349-2.1-v4.patch
          11 kB
          Stefan Podkowinski
        3. 11349-2.1-v3.patch
          14 kB
          Fabien Rousseau
        4. 11349-2.1-v2.patch
          14 kB
          Fabien Rousseau
        5. 11349-2.1.patch
          2 kB
          Stefan Podkowinski

        Issue Links

          Activity

            People

              blambov Branimir Lambov
              frousseau Fabien Rousseau
              Branimir Lambov
              Fabien Rousseau
              Votes:
              3 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: