Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Invalid
-
None
-
None
-
Challenging
-
Adhoc Test
-
All
-
None
Description
Hi,
I am a researcher working on finding scale issues in distributed systems. I have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958) holds the tasklock that could end up in the invocation of getAddressRepplicas, like this (format is [method][lineNumber]):
[org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these lines
[org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]
[org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]
[org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]
[org.apache.cassandra.service.StorageService.onChange][1551]
[org.apache.cassandra.service.StorageService.handleStateRemoving][2308]
[org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]
[org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]
[line=243, dimensions=[Peers * Tokens]] // Approx. Complexity of this loop
This seems to be affecting decommission path and the complexity is at least dependent on the number of tokens and peers in the cluster, thus when decommissioning a node with a large number of peers and tokens this path will end up holding the Gossiper's task lock for a long time, which could end up causing flapping.
This is likely to be affecting other 4.x versions too.