Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-10484

Activate/deactivate cluster suite hangs sporadically

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.8
    • None
    • None

    Description

      I saw the following thread dump on TC (only relevant parts are kept):

      "exchange-worker-#10918%cache.IgniteClusterActivateDeactivateTest0%" #13121 prio=5 os_prio=0 tid=0x00007f0720137800 nid=0xbcf runnable [0x00007f0b46f66000]
         java.lang.Thread.State: RUNNABLE
      	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
      	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
      	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
      	- locked <0x00000000df6b3f88> (a java.lang.Object)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3676)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3323)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2991)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2872)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2715)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2674)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1655)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1706)
      	at org.apache.ignite.internal.processors.cluster.ClusterProcessor.sendDiagnosticMessage(ClusterProcessor.java:614)
      	at org.apache.ignite.internal.processors.cluster.ClusterProcessor.requestDiagnosticInfo(ClusterProcessor.java:556)
      	at org.apache.ignite.internal.IgniteDiagnosticPrepareContext.send(IgniteDiagnosticPrepareContext.java:131)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.dumpDebugInfo(GridCachePartitionExchangeManager.java:1914)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2914)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2721)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
      	at java.lang.Thread.run(Thread.java:748)
      
      ...
      
      "start-node-3" #13223 prio=5 os_prio=0 tid=0x00007f08a8001800 nid=0xc30 waiting on condition [0x00007f0a577f5000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
      	at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
      	at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
      	at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1099)
      	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2040)
      	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1732)
      	- locked <0x00000000959ae1d0> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
      	at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
      	at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
      	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:959)
      	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:900)
      	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:888)
      	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:854)
      	at org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTest.lambda$testConcurrentJoinAndActivate$4(IgniteClusterActivateDeactivateTest.java:601)
      	at org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTest$$Lambda$183/933337479.call(Unknown Source)
      	at org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:84)
      
      ...
      
      "grid-nio-worker-tcp-comm-3-#11059%cache.IgniteClusterActivateDeactivateTest5%" #13297 prio=5 os_prio=0 tid=0x00007f08f809f000 nid=0xc83 waiting on condition [0x00007f0a4688d000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x0000000095e1f5c0> (a java.util.concurrent.CountDownLatch$Sync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
      	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
      	at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7470)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.getSpiContext(TcpCommunicationSpi.java:2268)
      	at org.apache.ignite.spi.IgniteSpiAdapter.getLocalNode(IgniteSpiAdapter.java:155)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeLocalNodeId(TcpCommunicationSpi.java:4016)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.nodeIdMessage(TcpCommunicationSpi.java:4009)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$300(TcpCommunicationSpi.java:271)
      	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2.onConnected(TcpCommunicationSpi.java:415)
      	at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionOpened(GridNioFilterChain.java:251)
      	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionOpened(GridNioFilterAdapter.java:88)
      	at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionOpened(GridNioCodecFilter.java:67)
      	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionOpened(GridNioFilterAdapter.java:88)
      	at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onSessionOpened(GridConnectionBytesVerifyFilter.java:58)
      	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionOpened(GridNioFilterAdapter.java:88)
      	at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionOpened(GridNioServer.java:3503)
      	at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionOpened(GridNioFilterChain.java:139)
      	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.register(GridNioServer.java:2616)
      	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1974)
      	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1795)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
      	at java.lang.Thread.run(Thread.java:748)
      

      The reason for the hang is that the joining node is waiting for the state transition, but one of the existing nodes cannot complete the exchange because it cannot establish a connection with new node because local node ID is not available yet.

      Separate question is why we need to wait for the SPI context initialization to obtain the local node ID when the local node ID is generated long before components start.

      Attachments

        Issue Links

          Activity

            People

              agoncharuk Alexey Goncharuk
              agoncharuk Alexey Goncharuk
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: