[CASSANDRA-12860] Nodetool repair fragile: cannot properly recover from single node failure. Has to restart all nodes in order to repair again - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Duplicate
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

CentOS 6.7, Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode), Cassandra 3.5.0, fresh install

Severity:
Critical

Description

Summary of symptom:

Set up is a multi-region cluster in AWS (5 regions). Each region has at least 4 hosts with RF=1/2 number of nodes, using V-nodes (256)

How to reproduce:

On node A, start this repair job (again we are running fresh 3.5.0):

nohup sudo nodetool repair -j 2 -pr -full myks > /tmp/repair.log 2>&1 &

Job starts fine, reporting progress like

[2016-10-28 22:37:52,692] Starting repair command #1, repairing keyspace myks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 2, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256)
[2016-10-28 22:38:35,099] Repair session 36f13450-9d5f-11e6-8bf7-a9f47ff986a9 for range [(4029874034937227774,4033949979656106020]] finished (progress: 1%)
[2016-10-28 22:38:38,769] Repair session 36f30910-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-2395606719402271267,-2394525508513518837]] finished (progress: 1%)
[2016-10-28 22:38:48,521] Repair session 36f3f370-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-5223108861718702793,-5221117649630514419]] finished (progress: 2%)

Then manually shutdown another node (node B) in the same region (haven't tried with other region yet but expect the same behavior from past experience)

Shortly after that seeing this message from job log (as well as in system.log) on node A:

[2016-10-28 22:41:46,268] Repair session 37088ce1-9d5f-11e6-8bf7-a9f47ff986a9 for range [(-928974038666914990,-927967994563261540]] failed with error Endpoint /node_B_ip died (progress: 51%)

From this point on, repair job seems to hang:
- no further messages from job log
- nor any related messages in system.log
- CPU stayed low (low single digit percent of 1 CPU)
After an hour (1hr), manually kill the repair jobs using "ps -eaf | grep repair"
Restart C* on node A
- Verified system is up and no error messages in system.log
- Also verified that there is no error messages from node B
After node A settles down (e.g. no new messages from system.log), restart the same repair job:
```
nohup sudo nodetool repair -j 2 -pr -full myks > /tmp/repair.log 2>&1 &
```

Job failes pretty quickly, reporting error from more nodes B and K:

 <production>[ywu@cass-tm-1b-012.apse1.mashery.com ~]$ tail -f /tmp/repair.log 
[2016-10-28 22:49:52,965] Starting repair command #1, repairing keyspace myks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 2, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256)
[2016-10-28 22:50:15,839] Repair session e4180720-9d60-11e6-b2f9-cb9524b3c536 for range [(4029874034937227774,4033949979656106020]] failed with error [repair #e4180720-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(4029874034937227774,4033949979656106020]]] Validation failed in /node_K_ip (progress: 1%)
[2016-10-28 22:50:17,158] Repair session e419dbe0-9d60-11e6-b2f9-cb9524b3c536 for range [(-2395606719402271267,-2394525508513518837]] failed with error [repair #e419dbe0-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(-2395606719402271267,-2394525508513518837]]] Validation failed in /node_B_ip (progress: 1%)
[2016-10-28 22:50:18,256] Repair session e41b1460-9d60-11e6-b2f9-cb9524b3c536 for range [(-5223108861718702793,-5221117649630514419]] failed with error [repair #e41b1460-9d60-11e6-b2f9-cb9524b3c536 on myks/rtable, [(-5223108861718702793,-5221117649630514419]]] Validation failed in /node_B_ip (progress: 2%)

On the said nodes (B and K), seeing similar errors:

ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,307 CompactionManager.java:1320 - Cannot start multiple repair sessions over the same sstables
ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,307 Validator.java:261 - Failed creating a merkle tree for [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02 on myks/atable, [(4029874034937227774,4033949979656106020]]], /52.220.127.190 (see log for details)
INFO  [AntiEntropyStage:1] 2016-10-28 22:58:45,307 Validator.java:274 - [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02] Sending completed merkle tree to /52.220.127.190 for myks.xtable
ERROR [ValidationExecutor:5] 2016-10-28 22:58:45,308 CassandraDaemon.java:195 - Exception in thread Thread[ValidationExecutor:5,1,main]
java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1321) ~[apache-cassandra-3.5.0.jar:3.5.0]
        at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1211) ~[apache-cassandra-3.5.0.jar:3.5.0]
        at org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:81) ~[apache-cassandra-3.5.0.jar:3.5.0]
        at org.apache.cassandra.db.compaction.CompactionManager$11.call(CompactionManager.java:841) ~[apache-cassandra-3.5.0.jar:3.5.0]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
INFO  [AntiEntropyStage:1] 2016-10-28 22:58:45,318 Validator.java:274 - [repair #14378ec0-9d62-11e6-ab75-cd4d64a01b02] Sending completed merkle tree to /52.220.127.190 for myks.ytable

At this point, we are back to where we were: kill the repair job on node A, then restart C* on BOTH nodes A and K, but still seeing the same exceptions except sometimes they are on other servers all over the ring.

Business impact: I am in the process of launch a Cassandra based production system but I have to hold back now because how fragile repair is. And I am told by many sources that I have to rely on periodical repair jobs to fix data inconsistencies.
The only work around was to rolling restart the Cassandra server on ALL nodes in the entire cluster
- Then the repair job can proceed without any error

Attachments

Issue Links

duplicates

CASSANDRA-11824 If repair fails no way to run repair again

Resolved

is a clone of

CASSANDRA-10519 RepairException: [repair #... on .../..., (...,...]] Validation failed in /w.x.y.z

Resolved

is related to

CASSANDRA-10302 Track repair state for more reliable repair

Open

Nodetool repair fragile: cannot properly recover from single node failure. Has to restart all nodes in order to repair again

Details

Description

Attachments

Issue Links

Activity

People

Dates