My first experiments aimed to quantify the length of Gossip messages and determine what factors drive the message length. I found the size of certain gossip messages increases proportionally with the number of vnodes (num_tokens in c.yaml). I recorded message size over the num_tokens and number of nodes domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes. I also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS). My hunch is that both vnode count and node count have a mild effect on user CPU resource usage.
What is the rough estimate of bytes sent for certain Gossip messages and why does this matter? The Phi Accrual Failure Detector (Hayashibara, et al) assumes fixed length heartbeat messages while Cassandra uses variable length messages. I observed a correlation with larger messages, higher vnodes and false positive detections by the Gossip FailureDetector. These observations, IMHO, are not explained by the research paper. I formed a hypothesis that the false positives are due to jitter in the interval values. I wondered if perhaps using a longer baseline to integrate over would reduce the jitter.
I have a second theory to follow up on. A newly added node will not have a long history of Gossip heartbeat interarrival times. At least 40 samples are needed to compute mean, variance with any statistical significance. It's possible the phi estimation algorithm is simply invalid for newly created nodes and that is why we see them flap shortly after creation.
In any case, the message of interest is the GossipDigestAck2 (GDA2) because it is the largest of the Gossip messages. GDA2 contains the set of EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes just added to an existing cluster. When each node becomes aware of joining node, they Gossip it to three randomly-chosen other nodes. The GDA2 message is tailored to contain the delta of new node metadata the receiving node is unaware of.
For a single node, the upper limit on GDA message size is roughly 3 * N * k * V
Where N is the number of nodes in the cluster,
V is the number of tokens (vnodes) per cluster,
k is a constant value, approximately 64 bytes, that represents a serialized token plus some other endpoint metadata.
If one is running hundreds of nodes in a cluster, the Gossip message traffic created when a node joins can be significant and increases with the number of nodes. I believe this to be the first order effect and probably violates one of the assumptions of the PHI Accrual Failure Detection, that heartbeat messages are small enough not to consume a relevant amount of compute or communication resources. The variable transmission time (due to variable length messages) is a clear violation of assumptions, if I've read the source code correctly.
On a related topic, there is a hard-coded limitation to the number of vnodes due to the serialization of the GDA messages.
No more than 1720 vnodes can be configured without creating a greater than 32K serialized String vnode message. A patch is provided below for future use should this become an issue.
In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many nodes join simultaneously. This is not an issue if the computer experiences no latency from competing workloads. In the real world, nodes are added because the cluster load has grown in terms of retained data, or in terms of a high transaction arrival rate. This means node resources may be fully utilized when adding new nodes is typically attempted.
It occured to me that we have another use case to accomodate. It is common to experience transient failure modes, even in modern data centers with disciplined maintenance practices. Ethernet cables get moved, switches and routers rebooted. BGP route errors and other temporary interruptions may occur with the network fabric in real world scenarios. People make mistakes, plans change and preventative maintenance often causes short-lived interruptions occur with network, CPU and disk subsystems.