Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-9000

NPE During Reconnect After Network Split

    XMLWordPrintableJSON

Details

    Description

      During a full network split when all members get shutdown by a partition, one of the servers continually fails to reconnect due to a NullPointerException. When using persistent regions, this also prevents the remaining members from correctly start up as they might be waiting for the stuck member to recover the latest data.
      The issue itself has been introduced by the fix for GEODE-8901, the new implementation for GMSJoinLeave.processNetworkPartitionMessage doesn't have a currentView installed during the reconnect phase (getView() == null) and the following is shown in the logs:

      [fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Unexpected exception while booting membership services
      java.lang.NullPointerException
      	at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
      	at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
      	at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
      	at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
      	at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
      	at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      
      [error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Unexpected problem starting up membership services
      java.lang.NullPointerException
      	at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
      	at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
      	at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
      	at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
      	at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
      	at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      
      [warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Caught SystemConnectException in reconnect
      org.apache.geode.SystemConnectException: Problem starting up membership services: null.  Consult log file for more details
      	at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189)
      	at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      
      [info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect attempt
      

      The above keeps happening during further reconnect attempts and the server member can't re-join the distributed system.

      Attachments

        Issue Links

          Activity

            People

              echobravo Ernest Burghardt
              jjramos Juan Ramos
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: