[CASSANDRA-17572] Race condition when IP address changes for a node can cause reads/writes to route to the wrong node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 3.0.x, 3.11.x, 5.x
Component/s: Cluster/Membership
Labels:
None

Bug Category:
Correctness - Recoverable Corruption / Loss
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None

Description

Hi,

We noticed that there is a race condition present in the trunk of 3.x code, and confirmed that it’s there in 4.x as well, which will result in incorrect reads, and missed writes, for a very short period of time.

What brought the race condition to our attention was due to the fact we started noticing a couple of missed writes for our Cassandra clusters in Kubernetes. We found the Kubernetes piece interesting, as IP changes are very frequent as opposed to a traditional setup.

More concretely:

When a Cassandra node is turned off, and then starts with a new IP address Z (former IP address X), it announces to the cluster (via gossip) it has IP Z for Host ID Y
If there are no conflicts, each node will decide to remove the old IP address associated with Host ID Y (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532) from the storage ring. This also causes us to invalidate our token ring cache (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488 ).
At this time, a new request could come in (read or write), and will re-calculate which endpoints to send the request to, as we’ve invalidated our token ring cache (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104).
However, at this time we’ve only removed the IP address X (former IP address), and have not re-added IP address Z.
As a result, we will choose a new host to route our request to. In our case, our keyspaces all run with NetworkTopologyStrategy, and so we simply choose the node with the next closest token in the same rack as host Y (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191).
Thus, the request is routed to a different host, rather than the host that has came back online.
However, shortly later, we re-add the host (via it’s new endpoint) to the token ring https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549
This will result in us invalidating our cache, and then again re-routing requests appropriately.

Couple of additional thoughts:

This doesn’t affect clusters where nodes <= RF with network topology strategy.
During this very brief period of time, CL for all user queries are violated, but are ACK’d as successful.
It’s easy to reproduce this race condition by simply adding a sleep here (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532)
If a cleanup is not ran before any range movement, it’s possible for rows that were temporarily written to the wrong node re-appear.
We tested that the race condition exists in our Cassandra 2.x fork (we're not on 3.x or 4.x). So, there is a possibility here that it's only for Cassandra 2.x, however unlikely from reading the code.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sam Kramer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Apr/22 16:34

Updated:: 07/Mar/23 10:54