Right. What I am saying is that for the purpose of picking nodes to send reads/writes to, it doesn't make sense to pick nodes that we know for a fact we cannot communicate with. As I indicate, I'm afraid that maybe this does not translate to gossip up/down state which may have other implications. But, in the most extreme case, if I'm connecting to a co-ordinator and submitting a read at CL.ONE, I don't want the co-ordinator to opt to send it to a node it knows for a fact it is currently not able to communicate with, causing RPC timeouts until rpc_timeout seconds have passed (more or less) and the dynamic snitch has figured things out.
Although... the more I think of it maybe it's better to just go for more actively weighting in outstanding RPC requests instead.
Basically, for what I've observed in high-throughput (lots of queries) clusters, most "hiccups" in day-to-day that spill over to applications tend to be one of:
- Process got stopped without disablegossip+wait, or got killed, etc.
- Process had a temporary hiccup (e.g., GC pause, networking glitch, sudden burst of streaming from another node causing a short duration of disk bottlenecking)
- Process is legitimately overloaded for a while, e.g. due to disk I/O, and despite dynamic snitching there is an annoyingly significant impact to clusters likely resulting from the periodic dynamic snitch reset.
All of these, in addition to many other "soft" failure modes that add up to "node is slow", should be helped quite a lot by significantly weighting outstanding request count when picking nodes. I'm not necessarily suggesting least-used flat out, and I'm aware that one could introduce foot shooting bere, and there are some performance vs. responsiveness concerns.
In the end, the goal is often not really to maximize throughput, but rather to keep a decent latency consistently. Not necessarily the lowest possible average latency, but avoiding extreme outliers. Particularly when application code is not good at dealing with concurrency spikes (e.g., thread limits, process limits, whatever).
If we get a read and know that we have 50 requests outstanding to the node that is closest according to the switch, but 0 outstanding to others, we shouldn't be adding another request onto that...
Thoughts? Am I trying to make the problem be simpler than what it is?