Details
-
Task
-
Status: Open
-
Low
-
Resolution: Unresolved
-
None
-
None
Description
The failure detector, ignoring all the theory, boils down to an
extremely simple algorithm. The FD keeps track of a sliding window (of
1000 currently) intervals of heartbeat for a given host. Meaning, we
have a track record of the last 1000 times we saw an updated heartbeat
for a host.
At any given moment, a host has a score which is simply the time since
the last heartbeat, over the mean interval in the sliding
window. For historical reasons a simple scaling factor is applied to
this prior to checking the phi conviction threshold.
(CASSANDRA-2597 has details, but thanks to Paul's work there it's now
trivial to understand what it does based on gut feeling)
So in effect, a host is considered down if we haven't heard from it in
some time which is significantly longer than the "average" time we
expect to hear from it.
This seems reasonable, but it does assume that under normal conditions
the average time between heartbeats does not change for reasons other
than those that would be plausible reasons to think a node is
unhealthy.
This assumption could be violated by the gossip-to-seed
feature. There is an argument to avoid gossip-to-seed for other
reasons (see CASSANDRA-3829), but this is a concrete case in which the
gossip-to-seed could cause a negative side-effect of the general kind
mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
not being continuously tested). Normally, due to gossip to seed,
everyone essentially sees latest information within very few hart
beats (assuming only 2-3 seeds). But should all seeds be down,
suddenly we flip a switch and start relying on generalized propagation
in the gossip system, rather than the seed special case.
The potential problem I forese here is that if the average propagation
time suddenly spikes when all seeds become available, it could cause
bogus flapping of nodes into down state.
In order to test this, I deployeda ~ 180 node cluster with a version
that logs heartbet information on each interpret(), similar to:
INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
It turns out that, at least at 180 nodes, with 4 seed nodes, whether
or not seeds are running does not seem to matter significantly. In
both cases, the mean interval is around 1500 milliseconds.
I don't feel I have a good grasp of whether this is incidental or
guaranteed, and it would be good to at least empirically test
propagation time w/o seeds at differnet cluster sizes; it's supposed
to be un-affected by cluster size (RING_DELAY is static for this
reason, is my understanding). Would be nice to see this be the case.