I'm -0 on the original bit of this ticket, but +1 on more generic changes that covers the original use case as good if not better anyway. I think that instead of trying to predict exactly the behavior of some particular event like compaction, we should just be better at actually responding to what is actually going on:
- We have
CASSANDRA-2540 which can help avoid blocking uselessly on a dropped or slow request even if we haven't had the opportunity to react to overall behavior yet (I have a partial patch that breaks read repair, I haven't had time to finish it).
- Taking into account the number of outstanding requests is IMO a necessity. There is plenty of precedent for anyone who wants that (least used connections policies in various LB:s), but more importantly it would so clearly help in several situations, including:
- Sudden GC pause of a node
- Sudden death of a node
- Sudden page cache eviction and slowness of a node, before snitching figures it out
- Constantly overloaded node; even with the dynsnitch it would improve the situation as the number of requests affected by a dynsnitch reset is lessened
- Packet loss/hiccup/whatever across DC:s
There is some potential for foot shooting in the sense that if a node is broken in a way that it responds with incorrect data, but responds faster than anyone else, it will tend to "swallow" all the traffic. But honestly, that feels like a minor concern to me based on what I've seen actually happen in production clusters. If we ever start sending non-successes back over inter-node RPC, this would change however.
My only major concern is potential performance impacts of keeping track of the number of outstanding requests, but if that does become a problem one can make it probabilistic - have N % of all requests be tracked. Less impact, but also less immediate response to what's happening.
This will also have the side-effect of mitigating sudden bursts of promotion into old-gen if we combine it with pro-actively dropping read-repair messages for nodes that are overloaded (effectively prioritizing data reads), hence helping for CASSANDRA-3853.
Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers?
In the absence of read repair, we'd have to do speculative reads as Stu has previously noted. With read repair turned on, this is not an issue because the node will still receive requests and eventually warm up. Only with read repair turned off do we not send requests to more than the first N of endpoints, with N being what is required by CL.
Semi-relatedly, I think it would be a good idea to make the proximity sorting probabilistic in nature so that we don't do a binary flip back and fourth between who gets data vs. digest reads or who doesn't get reads at all. That might mitigate this problem, but not help fundamentally since the rate of warm-up would decrease with a node being slow.
I do want to make this point though: Every single production cluster I have ever been involved with so far, has been such that you basically never want to turn read repair off. Not because of read repair itself, but because of the traffic it generates. Having nodes not receive traffic is extremely dangerous under most circumstances as it leaves nodes cold, only to suddenly explode and cause timeouts and other bad behavior as soon as e.g. some neighbor goes down and it suddenly starts taking traffic. This is an easy way to make production clusters fall over. If your workload is entirely in memory or otherwise not reliant on caching the problem is much less pronounced, but even then I would generally recommend that you keep it turned on if only because your nodes will have to be able to take the additional load anyway if you are to survive other nodes in the neighborhood going down. It just makes clusters much more easy to reason about.