Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-9402

Automatic Reconnect Failure: Address already in use

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: membership
    • Labels:
      None

      Description

      There are 2 locators and 4 servers during the test, once they're all up and running the test drops the network connectivity between all members to generate a full network partition and cause all members to shutdown and go into reconnect mode. Upon reaching the mentioned state, the test automatically restores the network connectivity and expects all members to automatically go up again and re-form the distributed system.
      This works fine most of the time, and we see every member successfully reconnecting to the distributed system:

      [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> tid=0x87] Reconnect completed.
      
      [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> tid=0x86] Reconnect completed.
      
      [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x94] Reconnect completed.
      
      [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> tid=0x96] Reconnect completed.
      
      [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> tid=0x97] Reconnect completed.
      
      [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> tid=0x95] Reconnect completed.
      

      In some rare occasions, though, one of the servers fails during the reconnect phase with the following exception:

      [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because:
      org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.
      	at org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
      	at org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
      	at org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
      	at org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
      	at org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
      	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.net.BindException: Address already in use (Bind failed)
      	at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
      	at java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
      	at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
      	at org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
      	at org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
      	at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
      	at org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
      	at org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
      	at org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
      	at org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
      	... 14 more
      

      It seems that the server is trying to bind the port before the old instance has finished shutting down and cleaning up resources, causing the reconnect process to halt and stop re-trying, and leaving the cluster with one less member.

      We've been able to reproduce the problem only twice in the past few weeks, I've attached the two set of artefacts to the ticket:

      • cluster_logs_pks_121: the member that throws the BindException during reconnect is gemfire-cluster-server-1.
      • cluster_logs_gke_latest_54: the member that throws the BindException during reconnect is gemfire-cluster-server-0.

        Attachments

        1. cluster_logs_gke_latest_54.zip
          791 kB
          Juan Ramos
        2. cluster_logs_pks_121.zip
          788 kB
          Juan Ramos

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jjramos Juan Ramos
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: