Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-18319

Cassandra in Kubernetes: IP switch decommission issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Normal
    • Resolution: Unresolved
    • 5.x
    • Cluster/Gossip
    • None
    • Correctness - Transient Incorrect Response
    • Normal
    • Normal
    • User Report
    • All
    • None

    Description

      We have recently encountered a recurring old IP reappearance issue while testing decommissions on some of our Kubernetes Cassandra staging clusters.

      Issue Description

      In Kubernetes, a Cassandra node can change IP at each pod bounce. We have noticed that this behavior, associated with a decommission operation, can get the cluster into an erroneous state.

      Consider the following situation: a Cassandra node node1 , with hostId1, owning 20.5% of the token ring, bounces and switches IP (old_IP → new_IP). After a couple gossip iterations, all other nodes’ nodetool status output includes a new_IP UN entry owning 20.5% of the token ring and no old_IP entry.

      Shortly after the bounce, node1 gets decommissioned. Our cluster does not have a lot of data, and the decommission operation completes pretty quickly. Logs on other nodes start showing acknowledgment that node1 has left and soon, nodetool status’ new_IP UL entry disappears. node1 ‘s pod is deleted.

      After a minute delay, the cluster enters the erroneous state. An  old_IP DN entry reappears in nodetool status, owning 20.5% of the token ring. No node owns this IP anymore and according to logs, old_IP is still associated with hostId1.

      Issue Root Cause

      By digging through Cassandra logs, and re-testing this scenario over and over again, we have reached the following conclusion: 

      • Other nodes will continue exchanging gossip about old_IP , even after it becomes a fatClient.
      • The fatClient timeout and subsequent quarantine does not stop old_IP from reappearing in a node’s Gossip state, once its quarantine is over. We believe that this is due to a misalignment on all nodes’ old_IP expiration time.
      • Once new_IP has left the cluster, and old_IP next gossip state message is received by a node, StorageService will no longer face collisions (or will, but with an even older IP) for hostId1 and its corresponding tokens. As a result, old_IP will regain ownership of 20.5% of the token ring.

      Proposed fix

      Following the above investigation, we were thinking about implementing the following fix:

      When a node receives a gossip status change with STATE_LEFT for a leaving endpoint new_IP, before evicting new_IP from the token ring, purge from Gossip (ie evictFromMembership) all endpoints that meet the following criteria:

      • endpointStateMap contains this endpoint
      • The endpoint is not currently a token owner (!tokenMetadata.isMember(endpoint))
      • The endpoint’s hostId matches the hostId of new_IP
      • The endpoint is older than leaving_IP (Gossiper.instance.compareEndpointStartup)
      • The endpoint’s token range (from endpointStateMap) intersects with new_IP’s

      This modification’s intention is to force nodes to realign on old_IP expiration, and expunge it from Gossip so it does not reappear after new_IP leaves the ring.

      Another approach we have also been considering is expunging old_IP at the moment of the StorageService collision resolution.

      Attachments

        1. test_decommission_after_ip_change_logs.zip
          1.03 MB
          Raymond Huffman
        2. node1_gossipinfo.txt
          3 kB
          Raymond Huffman
        3. 3.11_gossipinfo.zip
          4 kB
          Raymond Huffman
        4. v4.0_1678853171792_test_decommission_after_ip_change.zip
          706 kB
          Raymond Huffman
        5. write_failure.txt
          4 kB
          Raymond Huffman

        Issue Links

          Activity

            People

              Unassigned Unassigned
              inespotier Ines Potier
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m