[CASSANDRA-18913] Gossip NPE due to shutdown event corrupting empty statuses - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0.12, 4.1.4, 5.0-alpha2, 5.0
Component/s: Cluster/Gossip, Cluster/Membership
Labels:
None

Bug Category:
Correctness - Unrecoverable Corruption / Loss
Severity:
Critical
Complexity:
Low Hanging Fruit
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.0.0
Source Control Link:

https://github.com/apache/cassandra/commit/2bab3f27ba1535203d61497abe6810cdcb4640d0
Test and Documentation Plan:

Hide

new tests

Show
new tests

Description

When an instance either disables gossip or shuts down we send a gossip shutdown message, peers ignore it if the endpoint isn’t known, else it mutates its local copy of the state to mark shutdown…
When an instance restarts it populates gossip with the endpoints found in peers, but the state is empty (not null)

So, there is a fun timing bug…

stop node1
start node1; at this point all known endpoints before exist in gossip but are empty
node2 shutdown (gossip shutdown or node, doesn’t matter)
node1 sees the shutdown before gossip messages, and gets corruptted
node3 tries to join the cluster, fails due to node1 being corrupted

There are 2 different patterns the NPE can happen with, in this example node1 and node3 will have different stack traces

org.apache.cassandra.distributed.shared.ShutdownException: Uncaught exceptions were thrown during test
	Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)}
		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
		at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
		at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
		at org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
		at org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
		at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
	Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)}
		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
		at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
		at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
		at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
		at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
		at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
		at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
		at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
		at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)