(most observations below are from 0.8, but I just now tested on
trunk and I can trigger this problem just by bootstrapping a ~180
nod cluster concurrently, presumably due to the number of nodes that
are simultaneously in bootstrap state)
It turns out that:
- (1) calculatePendingRanges is not just expensive, it's computationally complex - cubic or worse
- (2) it gets called NOT just once per node being bootstrapped/leaving etc, but is called repeatedly while nodes are in these states
As a result, clusters start exploding when you start reading 100-300
nodes. The GossipStage will get backed up because a single
calculdatePenginRanges takes seconds, and depending on what the
average heartbeat interval is in relation to this, this can lead to
massive cluster-wide flapping.
This all started because we hit this in production; several nodes
would start flapping several other nodes as down, with many nodes
seeing the entire cluster, or a large portion of it, as down. Logging
in to some of these nodes you would see that they would be constantly
flapping up/down for minutes at a time until one became lucky and it
In the end we had to perform an emergency full-cluster restart with
gossip patched to force-forget certain nodes in bootstrapping state.
I can't go into all details here from the post-mortem (just the
write-up would take a day), but in short:
- We graphed the number of hosts in the cluster that had more than 5
Down (in a cluster that should have 0 down) on a minutely timeline.
- We also graphed the number of hosts in the cluster that had GossipStage backed up.
- The two graphs correlated extremely well
- jstack sampling showed it being CPU bound doing mostly sorting under calculatePendingRanges
- We were never able to exactly reproduce it with normal RING_DELAY and gossip intervals, even on a 184 node cluster (the production cluster is around 180).
- Dropping RING_DELAY and in particular dropping gossip interval to 10 ms instead of 1000 ms, we were able to observe all of the behavior we saw in production.
So our steps to reproduce are:
- Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 second.
- Do something like: while [ 1 ] ; do date ; echo decom ; nodetool decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo restarted waiting for a while ; sleep 40; done (or just do a manual decom/bootstrap once, it triggers every time)
- Watch all nodes flap massively and not recover at all, or maybe after a long time.
I observed the flapping using a python script that every 5 second
(randomly spread out) asked for unreachable nodes from all nodes in
the cluster, and printed any nodes and their counts when they had
unreachables > 5. The cluster can be observed instantly going into
massive flapping when leaving/bootstrap is initiated. Script needs
Cassandra running with Jolokia enabled for http/json access to
JMX. Can provide scrit if needed after cleanup.
The phi conviction, based on logging I added, was legitimate. Using
the 10 ms interval the average heartbeat interval ends up being like 25
ms or something like that. As a result, a single ~ 2 second delay in
gossip stage is huge in comparison to those 25 ms, and so we go past
the phi conviction threshold. This is much more sensitive than in
production, but it's the same effect, even if it triggers less
easily for real.
The best work around currently internally is to memoize
calculatePendingRanges so that we don't re-calculate if token meta
data, list of moving, list of bootstrapping and list of leaving are
all the same as on prior calculation. It's not entirely clear at this
point whether there is a clean fix to avoid executing
calculatePendingRanges more than once per unique node in this state.
It should be noted though that even if that is fixed, it is not
acceptable to spend several seconds doing these calculations on a ~
200 node cluster and it needs to be made fundamentally more efficient.
Here is a dump of thoughts by me in an internal JIRA ticket (not
exhaustive, I just went as far as to show that there is an issue;
there might be worse things I missed, but worse than cubic is bad
enough that I stopped):
(Comment uses 0.8 source.)
Okay, so let's break down the computational complexity here.
Suppose ring size is n and number of bootstrapping/leaving tokens is m. One of two places that take time (by measurement) is this part of calculatePendingRanges():
I'll ignore stuff that's log or better.
The outer loops is O(m). The inner loop is O, making aggregate so far O(nm).
We have a call in there to updateNormalTokens() which implies a sorting, which his O(n log). So now we're at O(n log m).
Next up we call getAddressRanges() which immediately does another O(n log sort. we're still at O(n log m. It then iterates (linear) and:
- calls getPrimaryRangeFor() for each.
- calls calculateNaturalEndpoints for each.
The former ends up sorting again, so now we're at O(n log n log m (worse than quadratic).
NTS.calculateNaturalEndpoints starts by collecting token meta data for nodes in the DC, by using updateNormalToken, which implies sorting. Woha woha. Now we're at O(n log n log n log m).
I might have missed things that are even worse, but this is bad enough to warrant this ticket. To put into perspective, 168 ^ 3 is 4.7 million.
|Status||Open [ 1 ]||Patch Available [ 10002 ]|
|Fix Version/s||1.2 [ 12319262 ]|
|Fix Version/s||1.1 [ 12317615 ]|
|Fix Version/s||1.2 [ 12319262 ]|
|Priority||Critical [ 2 ]||Major [ 3 ]|
|Status||Patch Available [ 10002 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Workflow||no-reopen-closed, patch-avail [ 12651556 ]||patch-available, re-open possible [ 12749230 ]|
|Workflow||patch-available, re-open possible [ 12749230 ]||reopen-resolved, no closed status, patch-avail, testing [ 12754125 ]|