Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16159

Concurrent host replacements can exchange incomplete gossip state if one is a seed



    • Code - Bug - Unclear Impact
    • Low
    • Low Hanging Fruit
    • User Report
    • All
    • None


      Noticed the following error in the failure detector during a host replacement:

      java.lang.IllegalArgumentException: Unknown endpoint:
      	at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
      	at org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
      	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
      	at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
      	at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
      	at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
      	at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
      	at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
      	at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
      	at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
      	at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)

      This particular error looks benign, given that even if it occurs, the node continues to handle the BOOT_REPLACE state. There are two things we might be able to do to improve FailureDetector#isAlive() though:

      1.) We don’t short circuit in the case that the endpoint in question is in quarantine after being removed. It may be useful to check for this so we can avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine works great when the gossip message is from a quarantined endpoint, but in this case, that would be the new/replacing and not the old/replaced one.)

      2.) We can reduce the severity of the logging from ERROR to WARN and provide better context around how to determine whether or not there’s actually a problem. (ex. “If this occurs while trying to determine liveness for a node that is currently being replaced, it can be safely ignored.”)




            jmeredithco Jon Meredith
            maedhroz Caleb Rackliffe
            Jon Meredith
            Caleb Rackliffe
            0 Vote for this issue
            4 Start watching this issue