Description
We observed that repair, for some of our clusters, streamed a lot of data and many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really high.
After investigation, it appears that, if two range tombstones exists for a partition for the same range/interval, they're both included in the merkle tree computation.
But, if for some reason, on another node, the two range tombstones were already compacted into a single range tombstone, this will result in a merkle tree difference.
Currently, this is clearly bad because MerkleTree differences are dependent on compactions (and if a partition is deleted and created multiple times, the only way to ensure that repair "works correctly"/"don't overstream data" is to major compact before each repair... which is not really feasible).
Below is a list of steps allowing to easily reproduce this case:
ccm create test -v 2.1.13 -n 2 -s ccm node1 cqlsh CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2}; USE test_rt; CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 float, c4 float, PRIMARY KEY ((c1), c2) ); INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; ctrl ^d # now flush only one of the two nodes ccm node1 flush ccm node1 cqlsh USE test_rt; INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; ctrl ^d ccm node1 repair # now grep the log and observe that there was some inconstencies detected between nodes (while it shouldn't have detected any) ccm node1 showlog | grep "out of sync"
Consequences of this are a costly repair, accumulating many small SSTables (up to thousands for a rather short period of time when using VNodes, the time for compaction to absorb those small files), but also an increased size on disk.
Attachments
Attachments
Issue Links
- contains
-
CASSANDRA-11477 MerkleTree mismatch when a cell is shadowed by a range tombstone in different SSTables
-
- Resolved
-
- relates to
-
CASSANDRA-11477 MerkleTree mismatch when a cell is shadowed by a range tombstone in different SSTables
-
- Resolved
-
- links to