Fix Version/s: None
CentOS 6.7, Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode), Cassandra 3.5.0, fresh install
Summary of symptom:
- Set up is a multi-region cluster in AWS (5 regions). Each region has at least 4 hosts with RF=1/2 number of nodes, using V-nodes (256)
- How to reproduce:
- On node A, start this repair job (again we are running fresh 3.5.0):
- Job starts fine, reporting progress like
- Then manually shutdown another node (node B) in the same region (haven't tried with other region yet but expect the same behavior from past experience)
- Shortly after that seeing this message from job log (as well as in system.log) on node A:
- From this point on, repair job seems to hang:
- no further messages from job log
- nor any related messages in system.log
- CPU stayed low (low single digit percent of 1 CPU)
- After an hour (1hr), manually kill the repair jobs using "ps -eaf | grep repair"
- Restart C* on node A
- Verified system is up and no error messages in system.log
- Also verified that there is no error messages from node B
- After node A settles down (e.g. no new messages from system.log), restart the same repair job:
- Job failes pretty quickly, reporting error from more nodes B and K:
- On the said nodes (B and K), seeing similar errors:
- At this point, we are back to where we were: kill the repair job on node A, then restart C* on BOTH nodes A and K, but still seeing the same exceptions except sometimes they are on other servers all over the ring.
- Business impact: I am in the process of launch a Cassandra based production system but I have to hold back now because how fragile repair is. And I am told by many sources that I have to rely on periodical repair jobs to fix data inconsistencies.
- The only work around was to rolling restart the Cassandra server on ALL nodes in the entire cluster
- Then the repair job can proceed without any error