Details
-
Bug
-
Status: Resolved
-
Urgent
-
Resolution: Fixed
-
None
-
Correctness - Recoverable Corruption / Loss
-
Critical
-
Challenging
-
User Report
-
All
-
None
-
Description
When adding a new node to a cluster, we see a lot of nodes reporting below error:
java.lang.NullPointerException: null at o.a.cassandra.gms.Gossiper.getHostId(Gossiper.java:1378) at o.a.cassandra.gms.Gossiper.getHostId(Gossiper.java:1373) at o.a.c.service.StorageService.handleStateBootstrap(StorageService.java:3088) at o.a.c.service.StorageService.onChange(StorageService.java:2783) at o.a.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1851) at o.a.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1816) at o.a.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1749) at o.a.c.g.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:81) at o.a.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:79) at o.a.cassandra.net.InboundSink.accept(InboundSink.java:98) at o.a.cassandra.net.InboundSink.accept(InboundSink.java:46) at o.a.c.n.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) at o.a.c.c.ExecutionFailure$1.run(ExecutionFailure.java:133) at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at i.n.u.c.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:829)
After some investigation of this issue, the existing nodes of the cluster have removed the new node as a fat client. The reason for this is the new node is busy with gossip and the gossip queue has a lot of task piling up. The gossip state for the new node on the existing host is:
/1.1.1.1 generation:1727479926 heartbeat:25 LOAD:20:31174.0 SCHEMA:16:59adb24e-f3cd-3e02-97f0-5b395827453f DC:12:dc1 RACK:14:0 RELEASE_VERSION:5:4.1.3 NET_VERSION:1:12 HOST_ID:2:b9cc4587-68f5-4bb6-a933-fd0c77a064dc INTERNAL_ADDRESS_AND_PORT:8:1.1.1.1:7000 NATIVE_ADDRESS_AND_PORT:3:1.1.1.1:9042 SSTABLE_VERSIONS:6:big-nb TOKENS: not present
Later this endpoint is removed from gossip endpointstate map because it is treated as a fat client.
FatClient /1.1.1.1:7000 has been silent for 30000ms, removing from gossip
But before it is removed from gossip, the node may have send gossip sync message to the new node asking for gossip info for this new node with heartbeat version larger than 20 in this example.
The new node gossip queue has too many task to be processed, so it cannot process this request immediately. When it send the gossip ack request back to the existing node, the node has removed the gossip info about the new node. So the gossip will look like below on some existing node:
/1.1.1.1 generation:1727479926 heartbeat:229 LOAD:200:3.0 SCHEMA:203:59adb24e-f3cd-3e02-97f0-5b395827453f
All the information relate to DC/Rack/Host ID is gone.
When the new node later get gossip settled and modified the local state as BOOT and decided its token. The existing node will receive the STATUS and TOKEN info, then the gossip state will become:
/1.1.1.1 generation:1727479926 heartbeat:329 LOAD:300:3.0 SCHEMA:303:59adb24e-f3cd-3e02-97f0-5b395827453f STATUS_WITH_PORT:308:BOOT,-142070360466566106 TOKENS:309:<hidden>
When the existing node process this bootstrap event, we will see the NPE due to host_id missing.
This issue will create consistency problem because for large clusters, a lot of nodes will consider the joining nodes a remote DC nodes if DC info is missing.