[CASSANDRA-19361] fix node info NPE when ClusterMetadata is null - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Cannot Reproduce
Fix Version/s: 5.x
Component/s: Tool/nodetool, Transactional Cluster Metadata
Labels:
None

Platform:

All
Impacts:

None

Description

How

I create an ensemble with 3 nodes(It works well), then I add the fourth node to join the party.
when executing nodetool info, get the following exception:

➜  bin ./nodetool info

java.lang.NullPointerException at org.apache.cassandra.service.StorageService.operationMode(StorageService.java:3744) at org.apache.cassandra.service.StorageService.isBootstrapFailed(StorageService.java:3810) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)   

➜  bin ./nodetool info 

WARN  [InternalResponseStage:152] 2024-02-02 11:45:15,731 RemoteProcessor.java:213 - Got error from /127.0.0.4:7000: TIMEOUT when sending TCM_COMMIT_REQ, retrying on CandidateIterator{candidates=[/127.0.0.4:7000], checkLive=true} error: null -- StackTrace -- java.lang.NullPointerException at org.apache.cassandra.service.StorageService.getLocalHostId(StorageService.java:1904) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260)

server 1 cannot execute node info and cql shell, server 2 and 3 can do it. Try to query the system prefix tables, I attach stack error log for the further debugging. Cannot find a way to recover. After deleting data(losing all data), restart and everything became OK

➜  bin ./nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load  Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.2  ?     16      51.2%             6d194555-f6eb-41d0-c000-000000000002  rack1
DN  127.0.0.4  ?     16      48.8%             6d194555-f6eb-41d0-c000-000000000001  rack1

When

It was introduced by the Patch: CEP-21. Anyway, the NPE check is needed to protect its propagation anywhere

Implementation of Transactional Cluster Metadata as described in CEP-21
Hash: ae084237
 
code diff:
 
    public String getLocalHostId()
     {
-        UUID id = getLocalHostUUID();
-        return id != null ? id.toString() : null;
+        return getLocalHostUUID().toString();
     }
 
     public UUID getLocalHostUUID()
     {
-        UUID id = getTokenMetadata().getHostId(FBUtilities.getBroadcastAddressAndPort());
-        if (id != null)
-            return id;
-        // this condition is to prevent accessing the tables when the node is not started yet, and in particular,
-        // when it is not going to be started at all (e.g. when running some unit tests or client tools).
-        else if ((DatabaseDescriptor.isDaemonInitialized() || DatabaseDescriptor.isToolInitialized()) && CommitLog.instance.isStarted())
-            return SystemKeyspace.getLocalHostId();
-
-        return null;
+        // Metadata collector requires using local host id, and flush of IndexInfo may race with
+        // creation and initialization of cluster metadata service. Metadata collector does accept
+        // null localhost ID values, it's just that TokenMetadata was created earlier.
+        ClusterMetadata metadata = ClusterMetadata.currentNullable();
+        if (metadata == null || metadata.directory.peerId(getBroadcastAddressAndPort()) == null)
+            return null;
+        return metadata.directory.peerId(getBroadcastAddressAndPort()).toUUID();
     }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CASSANDRA-19361-stack-error.txt
05/Feb/24 03:37
9 kB
Ling Mao

Issue Links

links to

GitHub Pull Request #3084

Activity

People

Assignee:: Ling Mao

Reporter:: Ling Mao

Authors:: Ling Mao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Feb/24 08:00

Updated:: 19/Jul/24 05:40

Resolved:: 18/Jul/24 16:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m