Resolution: Cannot Reproduce
Fix Version/s: None
Component/s: Local/Startup and Shutdown
When nodes restart in a large cluster, they mark all nodes as "alive", which first calls markDead and then creates an EchoMessage and in the callback to that marks the node as alive. This works great, except when that initial echo fails for w.e. reason and that node is marked as dead, in which case it will remain dead for a long while.
We mostly see this on 100+ node clusters, and almost always when nodes are in different datacenters that have unreliable network connections (e.g, cross region in AWS) and I think that it comes down to a combination of:
1. Only a node itself can mark another node as "UP"
2. Nodes only gossip with dead nodes with probability #dead / (#live +1)
In particular the algorithm in #2 leads to long convergence times because the number of dead nodes it typically very small compared to the cluster size. My back of the envelope model of this algorithm indicates that for a 100 node cluster this would take an average of ~50 seconds with a stdev of 50 seconds, which means we might be waiting minutes for the nodes to gossip with each other. I'm modeling this as the minimum of two geometric distributions with parameter p=1/#nodes, yielding a geometric distribution with parameter p=1-(1-(1/#nodes)^2). So for a 100 node cluster:
Since we gossip every second that means that on expectation in a 100 node cluster these nodes would see each other after about a minute and in a thousand node cluster, after ~8 minutes. For 100 node clusters the variance is astounding, and means that in particular edge cases we might be waiting hours before these nodes gossip with each other.
I'm thinking of writing a patch which either:
- Makes gossip order a shuffled list that includes dead nodes a la swim gossip. This would make it so that we waste some rounds on dead nodes but guarantee linear bounding of gossip.
- Adds an endpoint that re-triggers gossip with all nodes. Operators could call this after a restart a few times if they detect a gossip inconsistency.
- Bounding the probability we gossip with a dead node at some reasonable number like 1/10 or something. This might cause a lot of gossip load when a node is actually down for large clusters, but would also act to bound the variance.
- Something else?
I've got a WIP branch on 3.11 which implements options #1 and #2, but I can reduce/change/modify as needed if people think there is a better way. The patch doesn't pass tests yet but I'm not going to change/add the tests unless we think moving to time bounded gossip for down nodes is a good idea.