Status: Patch Available
Fix Version/s: None
Component/s: Legacy/Distributed Metadata
I ran across an edge case when replacing a node with the same address. This issue results in the node(and its tokens) being unsafely removed from gossip.
Steps to replicate:
1. Create 3 node cluster.
2. Stop a node
3. Replace the stopped node with a node using the same address using the replace_address flag
4. Stop the node before it finishes bootstrapping
5. Remove the replace_address flag and restart the node to resume bootstrapping (if the data dir is also cleared at this point the node will also generate new tokens when it starts)
6. Stop the node before it finishes bootstrapping again
7. 30 Seconds later the node will be removed from gossip because it now matches the check for a FatClient
I think this is only an issue when replacing a node with the same address because other replacements now use STATUS_BOOTSTRAPPING_REPLACE and leave the dead node unchanged.
I believe the simplest fix for this is to add a check that prevents a non-bootstrapped node (without the replaces_address flag) starting if there is a gossip entry for the same address in the hibernate state.