Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-17691

Gossip/Decommission tasklock contention on large clusters

    XMLWordPrintableJSON

Details

    • Challenging
    • Adhoc Test
    • All
    • None

    Description

      Hi,

      I am a researcher working on finding scale issues in distributed systems. I have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958) holds the tasklock that could end up in the invocation of getAddressRepplicas, like this (format is [method][lineNumber]):

      [org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
      Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these lines
       [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]
        [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]
         [org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]
          [org.apache.cassandra.service.StorageService.onChange][1551]
           [org.apache.cassandra.service.StorageService.handleStateRemoving][2308]
            [org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]
             [org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]
              [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]
               [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]
               [line=243, dimensions=[Peers * Tokens]] // Approx. Complexity of this loop

       

      This seems to be affecting decommission path and the complexity is at least dependent on the number of tokens and peers in the cluster, thus when decommissioning a node with a large number of peers and tokens this path will end up holding the Gossiper's task lock for a long time, which could end up causing flapping.

      This is likely to be affecting other 4.x versions too.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ucarebugfinder BugFinder
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: