Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19361

fix node info NPE when ClusterMetadata is null

    XMLWordPrintableJSON

Details

    • All
    • None

    Description

      How

       
      I create an ensemble with 3 nodes(It works well), then I add the fourth node to join the party. 
      when executing nodetool info, get the following exception:

      ➜  bin ./nodetool info
      
      java.lang.NullPointerException at org.apache.cassandra.service.StorageService.operationMode(StorageService.java:3744) at org.apache.cassandra.service.StorageService.isBootstrapFailed(StorageService.java:3810) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)   
      
      ➜  bin ./nodetool info 
      
      WARN  [InternalResponseStage:152] 2024-02-02 11:45:15,731 RemoteProcessor.java:213 - Got error from /127.0.0.4:7000: TIMEOUT when sending TCM_COMMIT_REQ, retrying on CandidateIterator{candidates=[/127.0.0.4:7000], checkLive=true} error: null -- StackTrace -- java.lang.NullPointerException at org.apache.cassandra.service.StorageService.getLocalHostId(StorageService.java:1904) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260)

      server 1 cannot execute node info and cql shell, server 2 and 3 can do it. Try to query the system prefix tables, I attach stack error log for the further debugging. Cannot find a way to recover. After deleting data(losing all data), restart and everything became OK

      ➜  bin ./nodetool status
      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address    Load  Tokens  Owns (effective)  Host ID                               Rack
      UN  127.0.0.2  ?     16      51.2%             6d194555-f6eb-41d0-c000-000000000002  rack1
      DN  127.0.0.4  ?     16      48.8%             6d194555-f6eb-41d0-c000-000000000001  rack1

      When

       
      It was introduced by the Patch: CEP-21. Anyway, the NPE check is needed to protect its propagation anywhere

      Implementation of Transactional Cluster Metadata as described in CEP-21
      Hash: ae084237
       
      code diff:
       
          public String getLocalHostId()
           {
      -        UUID id = getLocalHostUUID();
      -        return id != null ? id.toString() : null;
      +        return getLocalHostUUID().toString();
           }
       
           public UUID getLocalHostUUID()
           {
      -        UUID id = getTokenMetadata().getHostId(FBUtilities.getBroadcastAddressAndPort());
      -        if (id != null)
      -            return id;
      -        // this condition is to prevent accessing the tables when the node is not started yet, and in particular,
      -        // when it is not going to be started at all (e.g. when running some unit tests or client tools).
      -        else if ((DatabaseDescriptor.isDaemonInitialized() || DatabaseDescriptor.isToolInitialized()) && CommitLog.instance.isStarted())
      -            return SystemKeyspace.getLocalHostId();
      -
      -        return null;
      +        // Metadata collector requires using local host id, and flush of IndexInfo may race with
      +        // creation and initialization of cluster metadata service. Metadata collector does accept
      +        // null localhost ID values, it's just that TokenMetadata was created earlier.
      +        ClusterMetadata metadata = ClusterMetadata.currentNullable();
      +        if (metadata == null || metadata.directory.peerId(getBroadcastAddressAndPort()) == null)
      +            return null;
      +        return metadata.directory.peerId(getBroadcastAddressAndPort()).toUUID();
           } 

      Attachments

        Issue Links

          Activity

            People

              maoling Ling Mao
              maoling Ling Mao
              Ling Mao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m