I've done a series of stress tests with eager retries enabled that show undesirable behavior. I'm grouping these behaviours into one ticket as they are most likely related.
1) Killing off a node in a 4 node cluster actually increases performance.
2) Compactions make nodes slow, even after the compaction is done.
3) Eager Reads tend to lessen the immediate performance impact of a node going down, but not consistently.
1 stress machine: node0
4 C* nodes: node4, node5, node6, node7
node0 writes some data: stress -d node4 -F 30000000 -n 30000000 -i 5 -l 2 -K 20
node0 reads some data: stress -d node4 -n 30000000 -o read -i 5 -K 20
At 450s, I kill -9 one of the nodes. There is a brief decrease in performance as the snitch adapts, but then it recovers... to even higher performance than before.
The green and orange lines represent trials with eager retry enabled, they never recover their op-rate from before the compaction as the red and blue lines do.
This graph looked the most promising to me, the two trials with eager retry, the green and orange line, at 450s showed the smallest dip in performance.
This is a retrial with the same settings as above, yet the 95percentile eager retry (red line) did poorly this time at 450s.