Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-8652

member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()

    XMLWordPrintableJSON

Details

    Description

      An application encountered the following hang in a TLS-enabled cluster.

      Let's call the cluster members ds3 -> ds1.

      ds3 sends a PutAllPRMessage to ds1 and is stuck in SocketChannel.read() waiting for the acknowledgement.

      ClusterDistributionManager.uncleanShutdown() is invoked on ds3 to shut the member down. That thread blocks trying to acquire a lock on the NioSslEngine held by the first thread (the one doing waiting for the ack to the put-all.)

      Somehow the shutdown thread must be allowed to proceed.

      Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) trying to shut down the member but it's stuck waiting for the monitor on the NioSslEngine:

      "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 tid=0x00007fdb9c011000 nid=0x2fcc waiting for monitor entry [0x00007fdb6f4b7000]
         java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804)
      	- waiting to lock <0x00000000f2635b28> (a org.apache.geode.internal.net.NioSslEngine)
      	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350)
      	at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278)
      	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
      	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
      	at org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661)
      	- locked <0x00000000f2678cf8> (a java.util.ArrayList)
      	- locked <0x00000000f1187348> (a java.util.concurrent.ConcurrentHashMap)
      	at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487)
      	at org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644)
      	- locked <0x00000000f11867a8> (a org.apache.geode.distributed.internal.direct.DirectChannel)
      	at org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631)
      	at org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82)
      	at org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908)
      	at org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262)
      	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847)
      	at org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291)
      

      That thread is waiting on a lock held by this thread (in ds3) which is waiting on an acknowledgement to a PutAllPRMessage sent to ds1.

      "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 tid=0x00007fdb9c030800 nid=0x30d1 runnable [0x00007fdb732f0000]
         java.lang.Thread.State: RUNNABLE
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:192)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378)
              - locked <0x00000000f2643380> (a java.lang.Object)
              at org.apache.geode.internal.net.NioSslEngine.readAtLeast(NioSslEngine.java:330)
              at org.apache.geode.internal.tcp.MsgReader.readAtLeast(MsgReader.java:129)
              at org.apache.geode.internal.tcp.MsgReader.readHeader(MsgReader.java:58)
      ==>     - locked <0x00000000f2635b28> (a org.apache.geode.internal.net.NioSslEngine)
              at org.apache.geode.internal.tcp.Connection.readAck(Connection.java:2652)
              at org.apache.geode.distributed.internal.direct.DirectChannel.readAcks(DirectChannel.java:392)
              at org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:342)
              at org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182)
              at org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511)
              at org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346)
              at org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083)
              at org.apache.geode.internal.cache.partitioned.PutAllPRMessage.send(PutAllPRMessage.java:201)
              at org.apache.geode.internal.cache.PartitionedRegion.tryToSendOnePutAllMessage(PartitionedRegion.java:2839)
              at org.apache.geode.internal.cache.PartitionedRegion.sendMsgByBucket(PartitionedRegion.java:2621)
              at org.apache.geode.internal.cache.PartitionedRegion.postPutAllSend(PartitionedRegion.java:2392)
              at org.apache.geode.internal.cache.LocalRegionDataView.postPutAll(LocalRegionDataView.java:361)
              at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9154)
              at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:8903)
      

      What we see is that the MsgReader in the second thread is not letting the first thread close the socket. Until the socket is closed, the second thread will be stuck in SocketChannel.read().

      But why is the second thread stuck in SocketChannelImpl.read? That may be due to GEODE-8651!

      Attachments

        Issue Links

          Activity

            People

              burcham Bill Burcham
              burcham Bill Burcham
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: