Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-14559

Check for endpoint collision with hibernating nodes

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Normal
    • Resolution: Unresolved
    • Fix Version/s: None
    • Labels:
      None
    • Severity:
      Normal

      Description

      I ran across an edge case when replacing a node with the same address. This issue results in the node(and its tokens) being unsafely removed from gossip.

      Steps to replicate:

      1. Create 3 node cluster.
      2. Stop a node
      3. Replace the stopped node with a node using the same address using the replace_address flag
      4. Stop the node before it finishes bootstrapping
      5. Remove the replace_address flag and restart the node to resume bootstrapping (if the data dir is also cleared at this point the node will also generate new tokens when it starts)
      6. Stop the node before it finishes bootstrapping again
      7. 30 Seconds later the node will be removed from gossip because it now matches the check for a FatClient

      I think this is only an issue when replacing a node with the same address because other replacements now use STATUS_BOOTSTRAPPING_REPLACE and leave the dead node unchanged.

      I believe the simplest fix for this is to add a check that prevents a non-bootstrapped node (without the replaces_address flag) starting if there is a gossip entry for the same address in the hibernate state.

      3.11 PoC

       

        Attachments

          Activity

            People

            • Assignee:
              VincentWhite Vincent White
              Reporter:
              VincentWhite Vincent White
              Authors:
              Vincent White
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: