[CASSANDRA-14685] Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Urgent
Resolution: Unresolved
Fix Version/s: None
Component/s: Consistency/Repair
Labels:
None

Severity:
Critical

Description

The changes in ~~CASSANDRA-9143~~ modified the way incremental repair performs by applying the following sequence of events :

Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges
Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session
Merkle trees are generated and compared
Streaming takes place if needed
Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction, even after all the nodes in the cluster are restarted.

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) :

ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start

I used tlp-stress to generate a few 10s of MBs of data (killed it after some time). Obviously cassandra-stress works as well :

bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000      --replication "{'class':'SimpleStrategy', 'replication_factor':2}"       --compaction "{'class': 'SizeTieredCompactionStrategy'}"       --host 127.0.0.1

Flush and delete all SSTables in node1 :

ccm node1 nodetool flush
ccm node1 stop
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
ccm node1 start

Then throttle streaming throughput to 1MB/s so we have time to take node1 down during the streaming phase and run repair:

ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress

Once streaming starts, shut down node1 and start it again :

ccm node1 stop
ccm node1 start

Run repair again :

ccm node1 nodetool repair tlp_stress

The command will return very quickly, showing that it skipped all sstables :

[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  127.0.0.1  228,64 KiB  256          ?       437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256          ?       fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256          ?       a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1

sstablemetadata will then show that nodes 2 and 3 have SSTables still in "pending repair" state :

~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | grep repair
SSTable: /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62

Restarting these nodes wouldn't help either.

Attachments

Activity

People

Assignee:: Jason Brown

Reporter:: Alexander Dejanovski

Authors:: Jason Brown

Reviewers:: Blake Eggleston

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 31/Aug/18 17:15

Updated:: 16/Apr/19 09:29