[CASSANDRA-16364] Joining nodes simultaneously with auto_bootstrap:false can cause token collision - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 4.0.x, 4.1.x, 5.0.x
Component/s: Cluster/Membership
Labels:
None

Bug Category:
Correctness - Consistency
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None

Description

While raising a 6-node ccm cluster to test 4.0-beta4, 2 nodes chosen the same tokens using the default allocate_tokens_for_local_rf. However they both succeeded bootstrap with colliding tokens.

We were familiar with this issue from ~~CASSANDRA-13701~~ and ~~CASSANDRA-16079~~, and the workaround to fix this is to avoid parallel bootstrap when using allocate_tokens_for_local_rf.

However, since this is the default behavior, we should try to detect and prevent this situation when possible, since it can break users relying on parallel bootstrap behavior.

I think we could prevent this as following:
1. announce intent to bootstrap via gossip (ie. add node on gossip without token information)
2. wait for gossip to settle for a longer period (ie. ring delay)
3. allocate tokens (if multiple bootstrap attempts are detected, tie break via node-id)
4. broadcast tokens and move on with bootstrap

Attachments

Issue Links

is caused by

CASSANDRA-13701 Lower default num_tokens

Resolved

is duplicated by

CASSANDRA-19644 deterministic token allocation combined with slow gossip propogation can lead to data loss

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Paulo Motta

Reviewers:: Michael Semb Wever

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 19/Dec/20 22:53

Updated:: 19/May/24 08:17