Status: Triage Needed
Fix Version/s: None
Looks like nobody found this yet so this may be a ticking time bomb for some...
This happened to me earlier today. On a Cassandra 3.11.4 cluster with three DCs, one DC had three servers fail due to unexpected external circumstances. Replication was NTS configured with 2:2:2.
Cassandra dealt with the failures just fine - great! However, the nodes failed in a way that would make bringing them back impossible, so I tried to remove them using 'removenode'.
Suddenly, the application started experiencing a large number of QUORUM write timeouts. My first reflex was to lower the streaming throughput and compaction throughput, since timeouts indicated some overload was happening. No luck, though.
I tried a bunch of other things to reroute queries away from the affected datacenter, like changing the Severity field on the dynamic snitch. Still, no luck.
After a while I noticed one strange thing: the WriteTimeoutException listed that five replicas were required, instead of the four you would expect to see in a 2:2:2 replication configuration. I shrugged it off as some weird inconsistency that was probably because of the use of batches.
Skip ahead a bit, I decided to let the streams run again and just wait the issue out, since nothing I did was working, and maybe just letting the streams finish would resolve this overload. Magically, as soon as the streams finished, the errors stopped.
There are two issues here, both in AbstractWriteResponseHandler.java.
In totalBlockFor Cassandra will always include pending nodes in `blockfor`. In the case of a quorum query on a 2:2:2 replication factor, with two replicas in one DC down, this results in a blockfor of 5. If the pending replica is then also down (as can happen in a case where removenode is used and not all destination hosts are up), only 4 of the 5 hosts are available, and quorum queries will never succeed.
While debugging this, I spent all my time focusing on this issue as if it was a timeout. However, Cassandra was doing queries that could never succeed, because insufficient hosts were available. Throwing an UnavailableException would have been more helpful. The issue here is caused by assureSufficientLiveNodes which merely concats the lists of available nodes, and won't consider the special-case behavior of a pending node that's down.