Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.14.0
Description
During a full network split when all members get shutdown by a partition, one of the servers continually fails to reconnect due to a NullPointerException. When using persistent regions, this also prevents the remaining members from correctly start up as they might be waiting for the stuck member to recover the latest data.
The issue itself has been introduced by the fix for GEODE-8901, the new implementation for GMSJoinLeave.processNetworkPartitionMessage doesn't have a currentView installed during the reconnect phase (getView() == null) and the following is shown in the logs:
[fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Unexpected exception while booting membership services java.lang.NullPointerException at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428) at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782) at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) at java.base/java.lang.Thread.run(Thread.java:834) [error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Unexpected problem starting up membership services java.lang.NullPointerException at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428) at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782) at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) at java.base/java.lang.Thread.run(Thread.java:834) [warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Caught SystemConnectException in reconnect org.apache.geode.SystemConnectException: Problem starting up membership services: null. Consult log file for more details at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189) at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) at java.base/java.lang.Thread.run(Thread.java:834) [info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect attempt
The above keeps happening during further reconnect attempts and the server member can't re-join the distributed system.
Attachments
Issue Links
- is caused by
-
GEODE-8901 Surviving side server forcefully disconnected after network drop
- Closed
- links to