Sorry this has dragged on, I got bogged down with a few other things...
The linked branch modifies StorageService::prepareToJoin to always perform a collision check when not replacing and adds the ability to replace without bootstrap. To summarise the points from the comments above, and how they're addressed in the branch:
- Seed nodes will perform a shadow round, which they're unable to exit when all seeds are started concurrently. The patch addresses this by modifying the response to a shadow digest syn. If a node receiving a shadow syn is itself in a shadow round, rather than ignoring it, it responds with a minimal ack with both the digest list and state map empty.This indicates to the node sending the syn that the seed is in its shadow round. The syn-sending node is permitted to exit its shadow round if it receives a regular ack (current behaviour) or if all seeds are found to be in a shadow round.
- Whilst this is beneficial when all seeds are started concurrently, making that a mandatory requirement is, I feel, going to be overly burdensome for operators. In a cluster with > 1 seed, when all seeds are down the only options are to restart all seeds concurrently or to modify the seed list on one, restart that, then bring the others up before finally restoring the first seed's list and restarting it again. To fix this, the patch uses Stefania's approach and ultimately makes the shadow round a best effort. Note that this is for nodes which appear in their own seed list only, any node which doesn't consider itself a seed with fail to startup if it cannot complete a shadow round. I've also added a property, cassandra.allow_unsafe_join to skip the shadow round completely for use in testing.
- The collision check itself needs to be extended, as it's no longer performed only for bootstrapping nodes. When a node is not bootstrapping, we need to verify that its address is not already associated with another host id. When the node is bootstrapping, behaviour remains the same as before. That is, any previous status for the endpoint is retrieved from the shadow round & tested against a blacklist of disallowed previous states.
- As Paulo Motta noted, a side effect is to disallow the unsafe, ghetto-replace approach to dealing with scenarios such as a JBOD disk failure. With that in mind patch decouples bootstrap and replacement, so that the combination of replace_address(_first_boot) and auto_bootstrap=false is permitted. As this is a "genuine (but unsafe)" scenario, I've added startup flag to ensure that operators are cognizant of the risk involved in doing so. A benefit of this over the documented approach of manually setting initial tokens in yaml is that the replacement tokens are retrieved from gossip, reducing the chance for operator error. Performing this non-bootstrapping replace requires -Dcassandra.allow_unsafe_replace=true at startup.
The best-effort-only approach to the shadow round for seed nodes clearly undermines the safety benefits of all this somewhat, raising the question of whether it's actually worth making the modifications to Syn & Ack handling. In my opinion, those changes are fairly minimal & don't make things significantly harder to reason about, so I'm ok with including those but I won't argue too strongly if dropping them is suggested.
One of the utest failures looks particularly suspect. My suspicion is that something is already racy with the test setup as I can find the exact same failure in recent-ish another run. So far I've been unable to figure out exactly what the problem is by inspection & I haven't been able to repro the failure in > 100 test runs locally.
There were also some dtest failures on the earlier runs, caused by a bug which I've since fixed & verified locally (CI is pending).
I've pushed a dtest branch containing Tyler Hobbs' original new test, plus another for the unsafe replace operation here. A follow up will be to identify which tests are adversely affected by the mandatory shadow round (I know that ttl_test definitely is), then have them skip it if appropriate.