Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16159

Concurrent host replacements can exchange incomplete gossip state if one is a seed

    XMLWordPrintableJSON

    Details

    • Bug Category:
      Code - Bug - Unclear Impact
    • Severity:
      Low
    • Complexity:
      Low Hanging Fruit
    • Discovered By:
      User Report
    • Platform:
      All
    • Impacts:
      None

      Description

      Noticed the following error in the failure detector during a host replacement:

      java.lang.IllegalArgumentException: Unknown endpoint: 10.38.178.98:7000
      	at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
      	at org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
      	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
      	at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
      	at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
      	at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
      	at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
      	at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
      	at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
      	at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
      	at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      

      This particular error looks benign, given that even if it occurs, the node continues to handle the BOOT_REPLACE state. There are two things we might be able to do to improve FailureDetector#isAlive() though:

      1.) We don’t short circuit in the case that the endpoint in question is in quarantine after being removed. It may be useful to check for this so we can avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine works great when the gossip message is from a quarantined endpoint, but in this case, that would be the new/replacing and not the old/replaced one.)

      2.) We can reduce the severity of the logging from ERROR to WARN and provide better context around how to determine whether or not there’s actually a problem. (ex. “If this occurs while trying to determine liveness for a node that is currently being replaced, it can be safely ignored.”)

        Attachments

          Activity

            People

            • Assignee:
              jmeredithco Jon Meredith
              Reporter:
              maedhroz Caleb Rackliffe
              Authors:
              Jon Meredith
              Reviewers:
              Caleb Rackliffe
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: