Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-17572

Race condition when IP address changes for a node can cause reads/writes to route to the wrong node

    XMLWordPrintableJSON

Details

    • Correctness - Recoverable Corruption / Loss
    • Normal
    • Normal
    • User Report
    • All
    • None

    Description

      Hi,

      We noticed that there is a race condition present in the trunk of 3.x code, and confirmed that it’s there in 4.x as well, which will result in incorrect reads, and missed writes, for a very short period of time.

      What brought the race condition to our attention was due to the fact we started noticing a couple of missed writes for our Cassandra clusters in Kubernetes. We found the Kubernetes piece interesting, as IP changes are very frequent as opposed to a traditional setup.

      More concretely:

      1. When a Cassandra node is turned off, and then starts with a new IP address Z (former IP address X), it announces to the cluster (via gossip) it has IP Z for Host ID Y
      2. If there are no conflicts, each node will decide to remove the old IP address associated with Host ID Y (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532) from the storage ring. This also causes us to invalidate our token ring cache (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488 ).
      3. At this time, a new request could come in (read or write), and will re-calculate which endpoints to send the request to, as we’ve invalidated our token ring cache (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104).
      4. However, at this time we’ve only removed the IP address X (former IP address), and have not re-added IP address Z.
      5. As a result, we will choose a new host to route our request to. In our case, our keyspaces all run with NetworkTopologyStrategy, and so we simply choose the node with the next closest token in the same rack as host Y (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191).
      6. Thus, the request is routed to a different host, rather than the host that has came back online.
      7. However, shortly later, we re-add the host (via it’s new endpoint) to the token ring https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549
      8. This will result in us invalidating our cache, and then again re-routing requests appropriately.

      Couple of additional thoughts:

      • This doesn’t affect clusters where nodes <= RF with network topology strategy.
      • During this very brief period of time, CL for all user queries are violated, but are ACK’d as successful.
      • It’s easy to reproduce this race condition by simply adding a sleep here (https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532)
      • If a cleanup is not ran before any range movement, it’s possible for rows that were temporarily written to the wrong node re-appear. 
      • We tested that the race condition exists in our Cassandra 2.x fork (we're not on 3.x or 4.x). So, there is a possibility here that it's only for Cassandra 2.x, however unlikely from reading the code. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            Sam-Kramer Sam Kramer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: