[CASSANDRA-11349] MerkleTree mismatch when multiple range tombstones exists for the same partition and interval - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.16, 2.2.8
Component/s: None
Labels:
- repair

Severity:
Normal

Description

We observed that repair, for some of our clusters, streamed a lot of data and many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really high.

After investigation, it appears that, if two range tombstones exists for a partition for the same range/interval, they're both included in the merkle tree computation.
But, if for some reason, on another node, the two range tombstones were already compacted into a single range tombstone, this will result in a merkle tree difference.
Currently, this is clearly bad because MerkleTree differences are dependent on compactions (and if a partition is deleted and created multiple times, the only way to ensure that repair "works correctly"/"don't overstream data" is to major compact before each repair... which is not really feasible).

Below is a list of steps allowing to easily reproduce this case:

ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
    c1 text,
    c2 text,
    c3 float,
    c4 float,
    PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush 
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"

Consequences of this are a costly repair, accumulating many small SSTables (up to thousands for a rather short period of time when using VNodes, the time for compaction to absorb those small files), but also an increased size on disk.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

11349-2.2-v4.patch
02/Jun/16 10:15
11 kB
Stefan Podkowinski
11349-2.1-v4.patch
02/Jun/16 10:15
11 kB
Stefan Podkowinski
11349-2.1-v3.patch
02/May/16 14:30
14 kB
Fabien Rousseau
11349-2.1-v2.patch
28/Apr/16 13:49
14 kB
Fabien Rousseau
11349-2.1.patch
07/Apr/16 11:54
2 kB
Stefan Podkowinski

Issue Links

contains

CASSANDRA-11477 MerkleTree mismatch when a cell is shadowed by a range tombstone in different SSTables

Resolved

relates to

CASSANDRA-11477 MerkleTree mismatch when a cell is shadowed by a range tombstone in different SSTables

Resolved

links to

dtest-PR

Activity

People

Assignee:: Branimir Lambov

Reporter:: Fabien Rousseau

Authors:: Branimir Lambov

Reviewers:: Fabien Rousseau

Votes:: 3 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Mar/16 11:48

Updated:: 16/Apr/19 09:30

Resolved:: 05/Jul/16 09:27