Status: Triage Needed
I faced a weird issue when recovering a cluster after two nodes are stopped.
It is easily reproduce-able and looks like a bug or an issue to fix.
The following is a step to reproduce it.
=== STEP TO REPRODUCE ===
- Create a 3-node cluster with RF=3
- node1(seed), node2, node3
- Start requests to the cluster with cassandra-stress (it continues
until the end)
- what we did: cassandra-stress mixed cl=QUORUM duration=10m
-errors ignore -node node1,node2,node3 -rate threads\>=16
- (It doesn't have to be this many threads. Can be 1)
- Stop node3 normally (with systemctl stop or kill (without -9))
- the system is still available as expected because the quorum of nodes is
- Stop node2 normally (with systemctl stop or kill (without -9))
- the system is NOT available as expected after it's stopped.
- the client gets `UnavailableException: Not enough replicas
available for query at consistency QUORUM`
- the client gets errors right away (so few ms)
- so far it's all expected
- Wait for 1 mins
- Bring up node2 back
- The issue happens here.
- the client gets ReadTimeoutException` or WriteTimeoutException
depending on if the request is read or write even after the node2 is
- the client gets errors after about 5000ms or 2000ms, which are
request timeout for write and read request
- what node1 reports with `nodetool status` and what node2 reports
are not consistent. (node2 thinks node1 is down)
- It takes very long time to recover from its state
=== STEPS TO REPRODUCE ===
Some additional important information to note:
- If we don't start cassandra-stress, it doesn't cause the issue.
- Restarting node1 and it recovers its state right after it's restarted
- Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000
or something) fixes the issue
- If we `kill -9` the nodes, then it doesn't cause the issue.
- Hints seems not related. I tested with hints disabled, it didn't make any difference.