Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
4.0.12, 4.1.4, 5.0-alpha2, 5.0
-
None
-
Correctness - Unrecoverable Corruption / Loss
-
Critical
-
Low Hanging Fruit
-
User Report
-
All
-
None
-
Description
When an instance either disables gossip or shuts down we send a gossip shutdown message, peers ignore it if the endpoint isn’t known, else it mutates its local copy of the state to mark shutdown…
When an instance restarts it populates gossip with the endpoints found in peers, but the state is empty (not null)
So, there is a fun timing bug…
- stop node1
- start node1; at this point all known endpoints before exist in gossip but are empty
- node2 shutdown (gossip shutdown or node, doesn’t matter)
- node1 sees the shutdown before gossip messages, and gets corruptted
- node3 tries to join the cluster, fails due to node1 being corrupted
There are 2 different patterns the NPE can happen with, in this example node1 and node3 will have different stack traces
org.apache.cassandra.distributed.shared.ShutdownException: Uncaught exceptions were thrown during test Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)} at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756) at org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611) at org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39) at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)} at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756) at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
Attachments
Issue Links
- is related to
-
CASSANDRA-19115 Avoid bad serializedSize when sending GOSSIP_SHUTDOWN message
- Resolved
-
CASSANDRA-20033 Shutdown message doesn't have generation check causing normal node considered shutdown by other nodes in cluster
- Resolved
- links to