Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-16843

Timeout while thin client connection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • Docs Required, Release Notes Required

    Description

      In usecases with several active thin clients producing noticable load to cluster new thin clients can fail to connect with the "ClientConnectionException: Channel is closed" error in the TcpClientChannel::handshake() method.

      On server side warning "Unable to perform handshake within timeout [timeout=10000" is logged.

      The problem can be easily reproduced by several large putAlls invoked in parallel from several or single thin client.  Espesially for the TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only difference is that for ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.


      The reason of the problem is a fact that a single queue is used in the ignite node to serve all thin client related requests (queue in the GridThinClientExecutor). Both working requests like putAll and control ones like handshake which are used for connection establishment. 

      Working requests can live indefenitely in a queue (or at least longer then handshake).  On the other hand a special watchdog timer is scheduled in the ignite node to check if a handshake request is processed in a timely manner (ClientListenerNioListener::scheduleHandshakeTimeout).  By default the 10 seconds timeout is used (ClientConnectorConfiguration::handshakeTimeout). If timeout expires the client session is closed forcibly.

      So, if one or several thin clients fill queue with long operations new clients can not connect. 


      The real usecase reveals the problem is as follows.

      • 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
      • One TRANSACTIONAL cache with backups=1
      • About 30Gb of data on each node
      • Several (upto 75 at the same time) thin clients loading data using putAlls in 50000 records batches. Client connects, loads 3 batches and disconnects (spark jobs in fact). In other words clients connect and disconnect constantly.
      • Default handshakeTimeout (10 secs) and threadPoolSize(8) in ClientConnectorConfiguration

      Two ducktests were created to reproduce and isolate  the problem (see the ignitetest/tests/thin_client_concurrency in the attached pull request).

      Note that asserts in testcases done in a way that test PASS means that problem IS reproduced.  Tests check that at least one thin client fails with the "Channel is closed" error and that server node log contains warning about the handshake timeout.

      ThinClientConcurrencyTest::test_many_thin_clients

      Mimics the above real life usecase. Several thin client processes invoke putAlls in several threads. There are two sets of parameters - one for TRANSACTIONAL and one for ATOMIC cache.

      ThinClientConcurrencyTest::test_one_greedy_thin_client

      Minimal test shows that a single thin client can produce such a load that another one can not connect.

      On the attached metrics screenshot the behaviour of the test is shown in details:

      1. The first thin client invoked and started data load with putAlls in several threads
      2. The second thin client is invoked once the queue is filled
      3. After 10 seconds the session of the second client is closed
      4. Executor takes the handshake request from queue and (erroneously?) increases the client_connector_AcceptedSessions metric (note that _ActiveSessions _ wasn't increased).

      test_one_greedy_thin_client.png

       


      The following full stack trace is logged on the client side:

      org.apache.ignite.client.ClientConnectionException: Channel is closed
          at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpClientChannel.<init>(TcpClientChannel.java:180) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:126) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:102) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339) ~[classes/:?]
          at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
          at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181) ~[ignite-ducktests-2.
      14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
          at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138) [ignite-ducktests-2.14.0-S
      NAPSHOT.jar:2.14.0-SNAPSHOT]
          at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:111) [ignite-ducktests-2.14.0-S
      NAPSHOT.jar:2.14.0-SNAPSHOT]
          at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_322]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
          at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
      Caused by: org.apache.ignite.IgniteCheckedException: Channel is closed
          at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7863) ~[classes/:?]
          at org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260) ~[classes/:?]
          at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191) ~[classes/:?]
          at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:591) ~[classes/:?]
          ... 18 more
      Caused by: org.apache.ignite.client.ClientConnectionException: Channel is closed
          at org.apache.ignite.internal.client.thin.TcpClientChannel.close(TcpClientChannel.java:213) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.TcpClientChannel.onDisconnected(TcpClientChannel.java:199) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientConnection.onDisconnected(GridNioClientConnection.java:95) ~[classes/:?]
          at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientListener.onDisconnected(GridNioClientListener.java:42) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionClosed(GridNioFilterChain.java:256) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionClosed(GridNioCodecFilter.java:71) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3690) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2877) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2794) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1189) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2508) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2273) ~[classes/:?]
          at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910) ~[classes/:?]
          at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[classes/:?]
          ... 1 more
      

      Attachments

        1. test_one_greedy_thin_client.png
          200 kB
          Sergey Korotkov

        Issue Links

          Activity

            People

              Unassigned Unassigned
              serge.korotkov Sergey Korotkov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m