Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19983

Gossip issue with gossip-only nodes (fat clients) leads to missing DC/Rack/Host ID endpoint state

    XMLWordPrintableJSON

Details

    Description

      When adding a new node to a cluster, we see a lot of nodes reporting below error:

      java.lang.NullPointerException: null at o.a.cassandra.gms.Gossiper.getHostId(Gossiper.java:1378) at o.a.cassandra.gms.Gossiper.getHostId(Gossiper.java:1373) at o.a.c.service.StorageService.handleStateBootstrap(StorageService.java:3088) at o.a.c.service.StorageService.onChange(StorageService.java:2783) at o.a.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1851) at o.a.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1816) at o.a.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1749) at o.a.c.g.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:81) at o.a.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:79) at o.a.cassandra.net.InboundSink.accept(InboundSink.java:98) at o.a.cassandra.net.InboundSink.accept(InboundSink.java:46) at o.a.c.n.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) at o.a.c.c.ExecutionFailure$1.run(ExecutionFailure.java:133) at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at i.n.u.c.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:829)

      After some investigation of this issue, the existing nodes of the cluster have removed the new node as a fat client. The reason for this is the new node is busy with gossip and the gossip queue has a lot of task piling up. The gossip state for the new node on the existing host is:

       

       

      /1.1.1.1
        generation:1727479926
        heartbeat:25
        LOAD:20:31174.0
        SCHEMA:16:59adb24e-f3cd-3e02-97f0-5b395827453f
        DC:12:dc1
        RACK:14:0
        RELEASE_VERSION:5:4.1.3
        NET_VERSION:1:12
        HOST_ID:2:b9cc4587-68f5-4bb6-a933-fd0c77a064dc
        INTERNAL_ADDRESS_AND_PORT:8:1.1.1.1:7000
        NATIVE_ADDRESS_AND_PORT:3:1.1.1.1:9042
        SSTABLE_VERSIONS:6:big-nb
        TOKENS: not present 

      Later this endpoint is removed from gossip endpointstate map because it is treated as a fat client.

       

       

      FatClient /1.1.1.1:7000 has been silent for 30000ms, removing from gossip 

      But before it is removed from gossip, the node may have send gossip sync message to the new node asking for gossip info for this new node with heartbeat version larger than 20 in this example.

       

      The new node gossip queue has too many task to be processed, so it cannot process this request immediately. When it send the gossip ack request back to the existing node, the node has removed the gossip info about the new node. So the gossip will look like below on some existing node:

      /1.1.1.1 
        generation:1727479926 
        heartbeat:229 
        LOAD:200:3.0 
        SCHEMA:203:59adb24e-f3cd-3e02-97f0-5b395827453f 

      All the information relate to DC/Rack/Host ID is gone.

      When the new node later get gossip settled and modified the local state as BOOT and decided its token. The existing node will receive the STATUS and TOKEN info, then the gossip state will become:

      /1.1.1.1 
        generation:1727479926 
        heartbeat:329 
        LOAD:300:3.0 
        SCHEMA:303:59adb24e-f3cd-3e02-97f0-5b395827453f
        STATUS_WITH_PORT:308:BOOT,-142070360466566106
        TOKENS:309:<hidden>

      When the existing node process this bootstrap event, we will see the NPE due to host_id missing.

      This issue will create consistency problem because for large clusters, a lot of nodes will consider the joining nodes a remote DC nodes if DC info is missing.

      Attachments

        1. ci_driftx_CASSANDRA-19983-5.0_128_summary.html
          1.90 MB
          Michael Semb Wever
        2. results_details_driftx_CASSANDRA-19983-5.0_128.tar.xz
          2.33 MB
          Michael Semb Wever

        Activity

          People

            curlylrt Runtian Liu
            curlylrt Runtian Liu
            Runtian Liu
            Brandon Williams, Michael Semb Wever
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: