[CASSANDRA-17691] Gossip/Decommission tasklock contention on large clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Invalid
Fix Version/s: None
Component/s: Cluster/Gossip, Cluster/Membership
Labels:
None

Complexity:
Challenging
Discovered By:
Adhoc Test
Platform:

All
Impacts:

None
Since Version:

4.0.0

Description

Hi,

I am a researcher working on finding scale issues in distributed systems. I have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958) holds the tasklock that could end up in the invocation of getAddressRepplicas, like this (format is [method][lineNumber]):

[org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these lines
[org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]
[org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]
[org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]
[org.apache.cassandra.service.StorageService.onChange][1551]
[org.apache.cassandra.service.StorageService.handleStateRemoving][2308]
[org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]
[org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]
[line=243, dimensions=[Peers * Tokens]] // Approx. Complexity of this loop

This seems to be affecting decommission path and the complexity is at least dependent on the number of tokens and peers in the cluster, thus when decommissioning a node with a large number of peers and tokens this path will end up holding the Gossiper's task lock for a long time, which could end up causing flapping.

This is likely to be affecting other 4.x versions too.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: BugFinder

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Jun/22 15:33

Updated:: 10/Jun/22 16:00

Resolved:: 10/Jun/22 15:41