[CASSANDRA-14001] Gossip after node restart can take a long time to converge about "down" nodes in large clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Low
Resolution: Cannot Reproduce
Fix Version/s: None
Component/s: Local/Startup and Shutdown
Labels:
None

Description

When nodes restart in a large cluster, they mark all nodes as "alive", which first calls markDead and then creates an EchoMessage and in the callback to that marks the node as alive. This works great, except when that initial echo fails for w.e. reason and that node is marked as dead, in which case it will remain dead for a long while.

We mostly see this on 100+ node clusters, and almost always when nodes are in different datacenters that have unreliable network connections (e.g, cross region in AWS) and I think that it comes down to a combination of:
1. Only a node itself can mark another node as "UP"
2. Nodes only gossip with dead nodes with probability #dead / (#live +1)

In particular the algorithm in #2 leads to long convergence times because the number of dead nodes it typically very small compared to the cluster size. My back of the envelope model of this algorithm indicates that for a 100 node cluster this would take an average of ~50 seconds with a stdev of 50 seconds, which means we might be waiting minutes for the nodes to gossip with each other. I'm modeling this as the minimum of two geometric distributions with parameter p=1/#nodes, yielding a geometric distribution with parameter p=1-(1-(1/#nodes)^2). So for a 100 node cluster:

100 node cluster =>
X = Pr(node1 gossips with node2) = geom(0.01)
Y = Pr(node 2 gossips with node1) = geom(0.01)
Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02)
E[Z] = 1/0.02 = 50
V[Z] = (1-0.02)/(0.02)^2 = 2450

1000 node cluster ->
Z = geom(1 - (1 - 0.001)^2) = geom(0.002)
E[Z] = 500
V[Z] = 24500

Since we gossip every second that means that on expectation in a 100 node cluster these nodes would see each other after about a minute and in a thousand node cluster, after ~8 minutes. For 100 node clusters the variance is astounding, and means that in particular edge cases we might be waiting hours before these nodes gossip with each other.

I'm thinking of writing a patch which either:

Makes gossip order a shuffled list that includes dead nodes a la swim gossip. This would make it so that we waste some rounds on dead nodes but guarantee linear bounding of gossip.
Adds an endpoint that re-triggers gossip with all nodes. Operators could call this after a restart a few times if they detect a gossip inconsistency.
Bounding the probability we gossip with a dead node at some reasonable number like 1/10 or something. This might cause a lot of gossip load when a node is actually down for large clusters, but would also act to bound the variance.
Something else?

I've got a WIP branch on 3.11 which implements options #1 and #2, but I can reduce/change/modify as needed if people think there is a better way. The patch doesn't pass tests yet but I'm not going to change/add the tests unless we think moving to time bounded gossip for down nodes is a good idea.

Attachments

Issue Links

relates to

CASSANDRA-14358 Partitioned outbound internode TCP connections can occur when nodes restart

Resolved

CASSANDRA-14297 Startup checker should wait for count rather than percentage

Resolved

CASSANDRA-13993 Add optional startup delay to wait until peers are ready

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joey Lynch

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 08/Nov/17 01:18

Updated:: 16/Apr/19 09:29

Resolved:: 30/Mar/18 23:58