there is the potential to take us back to the Bad Old Days when HH could cause cascading failure
To elaborate, the scenario here is, we did a write that succeeded on some nodes, but not others. So we need to write a local hint to replay to the down-or-slow nodes later. But, those nodes being down-or-slow mean load has increased on the rest of the cluster, and writing the extra hint will increase that further, possibly enough that other nodes will see this coordinator as down-or-slow, too, and so on.
So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time, throw back a TimedOutException which is already our signal that "your cluster may be overloaded, you need to back off."
Specifically, we could add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue the write but if it is rejected because queue is full we throw the TOE. Otherwise, we wait for the write and then return success to the client.
The tricky part is the queue needs to be large enough to handle load spikes but small enough that wait-for-success-post-enqueue is negligible compared to RpcTimeout. If we had different timeouts for writes than reads (which we don't –
CASSANDRA-959) then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase.