This is a proposed phrasing for an upstream ticket named "Newly discovered nodes that are down get stuck in UP state forever" (will edit w/ feedback until done):
We have a observed a problem with gossip which, when you are bootstrapping a new node (or replacing using the replace_token support), any node in the cluster which is Down at the time the node is started, will be assumed to be Up and then never ever flapped back to Down until you restart the node.
This has at least two implications to replacing or bootstrapping new nodes when there are nodes down in the ring:
- If the new node happens to select a node listed as (UP but in reality is DOWN) as a stream source, streaming will sit there hanging forever.
- If that doesn't happen (by picking another host), it will instead finish bootstrapping correctly, and begin servicing requests all the while thinking DOWN nodes are UP, and thus routing requests to them, generating timeouts.
The way to get out of this is to restart the node(s) that you bootstrapped.
I have tested and confirmed the symptom (that the bootstrapped node things other nodes are Up) using a fairly recent 1.0. The main debugging effort happened on 0.8 however, so all details below refer to 0.8 but are probably similar in 1.0.
Steps to reproduce:
- Bring up a cluster of >= 3 nodes. Ensure RF is < N, so that the cluster is operative with one node removed.
- Pick two random nodes A, and B. Shut them both off.
- Wait for everyone to realize they are both off (for good measure).
- Now, take node A and nuke it's data directories and re-start it, such that it comes up w/ normal bootstrap (or use replace_token; didn't test that but should not affect it).
- Watch how node A starts up, all the while believing node B is down, even though all other nodes in the cluster agree that B is down and B is in fact still turned off.
The mechanism by which it initially goes into Up state is that the node receives a gossip response from any other node in the cluster, and GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().
Gossiper.applyStateLocally() doesn't have any local endpoint state for the cluster, so the else statement at the end ("it's a new node") gets triggered and handleMajorStateChange() is called. handleMajorStateChange() always calls markAlive(), unless the state is a dead state (but "dead" here does not mean "not up", but refers to joining/hibernate etc).
So at this point the node is up in the mind of the node you just bootstrapped.
Now, in each gossip round doStatusCheck() is called, which iterates over all nodes (including the one falsly Up) and among other things, calls FailureDetector.interpret() on each node.
FailureDetector.interpret() is meant to update its sense of Phi for the node, and potentially convict it. However there is a short-circuit at the top, whereby if we do not yet have any arrival window for the node, we simply return immediately.
Arrival intervals are only added as a result of a FailureDetector.report() call, which never happens in this case because the initial endpoint state we added, which came from a remote node that was up, had the latest version of the gossip state (so Gossiper.reportFailureDetector() will never call report()).
The result is that the node can never ever be convicted.
Now, let's ignore for a moment the problem that a node that is actually Down will be thought to be Up temporarily for a little while. That is sub-optimal, but let's aim for a fix to the more serious problem in this ticket - which is that is stays up forever.
- When interpret() gets called and there is no arrival window, we could add a faked arrival window far back in time to cause the node to have history and be marked down. This "works" in the particular test case. The problem is that since we are not ourselves actively trying to gossip to these nodes with any particular speed, it might take a significant time before we get any kind of confirmation from someone else that it's actually Up in cases where the node actually is Up, so it's not clear that this is a good idea.
- When interpret() gets called and there is no arrival window, we can simply convict it immediately. This has roughly similar behavior as the previous suggestion.
- When interpret() gets called and there is no arrival window, we can add a faked arrival window at the current time, which will allow it to be treated as Up until the usual time has passed before we exceed the Phi conviction threshold.
- When interpret() gets called and there is no arrival window, we can immediately convict it, and schedule it for immediate gossip on the next round in order to try to ensure they go Up quickly if they are indeed up. This has an effect of O gossip traffic, as a special case once during node start-up. While theoretically a problem, I personally thing we can ignore it for now since it won't be a significant problem any time soon. However, this is more complicated since the way we queue up messages is asynchronously to background connection attempts. We'd have to make sure the initial gossip message actually gets sent on an open TCP connection (I haven't confirmed whether this will be the case or not).
The first three are simple to implement, possibly the fourth. But in all cases, I am worried about potential negative consequences that I am not seeing.