[CASSANDRA-18319] Cassandra in Kubernetes: IP switch decommission issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 5.x
Component/s: Cluster/Gossip
Labels:
None

Bug Category:
Correctness - Transient Incorrect Response
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None

Description

We have recently encountered a recurring old IP reappearance issue while testing decommissions on some of our Kubernetes Cassandra staging clusters.

Issue Description

In Kubernetes, a Cassandra node can change IP at each pod bounce. We have noticed that this behavior, associated with a decommission operation, can get the cluster into an erroneous state.

Consider the following situation: a Cassandra node node1 , with hostId1, owning 20.5% of the token ring, bounces and switches IP (old_IP → new_IP). After a couple gossip iterations, all other nodes’ nodetool status output includes a new_IP UN entry owning 20.5% of the token ring and no old_IP entry.

Shortly after the bounce, node1 gets decommissioned. Our cluster does not have a lot of data, and the decommission operation completes pretty quickly. Logs on other nodes start showing acknowledgment that node1 has left and soon, nodetool status’ new_IP UL entry disappears. node1 ‘s pod is deleted.

After a minute delay, the cluster enters the erroneous state. An old_IP DN entry reappears in nodetool status, owning 20.5% of the token ring. No node owns this IP anymore and according to logs, old_IP is still associated with hostId1.

Issue Root Cause

By digging through Cassandra logs, and re-testing this scenario over and over again, we have reached the following conclusion:

Other nodes will continue exchanging gossip about old_IP , even after it becomes a fatClient.
The fatClient timeout and subsequent quarantine does not stop old_IP from reappearing in a node’s Gossip state, once its quarantine is over. We believe that this is due to a misalignment on all nodes’ old_IP expiration time.
Once new_IP has left the cluster, and old_IP next gossip state message is received by a node, StorageService will no longer face collisions (or will, but with an even older IP) for hostId1 and its corresponding tokens. As a result, old_IP will regain ownership of 20.5% of the token ring.

Proposed fix

Following the above investigation, we were thinking about implementing the following fix:

When a node receives a gossip status change with STATE_LEFT for a leaving endpoint new_IP, before evicting new_IP from the token ring, purge from Gossip (ie evictFromMembership) all endpoints that meet the following criteria:

endpointStateMap contains this endpoint
The endpoint is not currently a token owner (!tokenMetadata.isMember(endpoint))
The endpoint’s hostId matches the hostId of new_IP
The endpoint is older than leaving_IP (Gossiper.instance.compareEndpointStartup)
The endpoint’s token range (from endpointStateMap) intersects with new_IP’s

This modification’s intention is to force nodes to realign on old_IP expiration, and expunge it from Gossip so it does not reappear after new_IP leaves the ring.

Another approach we have also been considering is expunging old_IP at the moment of the StorageService collision resolution.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_decommission_after_ip_change_logs.zip
14/Mar/23 21:11
1.03 MB
Raymond Huffman
node1_gossipinfo.txt
14/Mar/23 21:20
3 kB
Raymond Huffman
3.11_gossipinfo.zip
14/Mar/23 21:24
4 kB
Raymond Huffman
v4.0_1678853171792_test_decommission_after_ip_change.zip
15/Mar/23 04:08
706 kB
Raymond Huffman
write_failure.txt
22/Mar/23 20:55
4 kB
Raymond Huffman

Issue Links

relates to

CASSANDRA-8260 Replacing a node can leave the old node in system.peers on the replacement

Resolved

CASSANDRA-8304 Explore evicting replacement state sooner

Open

Activity

People

Assignee:: Unassigned

Reporter:: Ines Potier

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Mar/23 15:48

Updated:: 29/Mar/23 15:47

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m