Details
Description
In usecases with several active thin clients producing noticable load to cluster new thin clients can fail to connect with the "ClientConnectionException: Channel is closed" error in the TcpClientChannel::handshake() method.
On server side warning "Unable to perform handshake within timeout [timeout=10000" is logged.
The problem can be easily reproduced by several large putAlls invoked in parallel from several or single thin client. Espesially for the TRANSACTIONAL caches. But ATOMIC caches are also affected - the only difference is that for ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.
The reason of the problem is a fact that a single queue is used in the ignite node to serve all thin client related requests (queue in the GridThinClientExecutor). Both working requests like putAll and control ones like handshake which are used for connection establishment.
Working requests can live indefenitely in a queue (or at least longer then handshake). On the other hand a special watchdog timer is scheduled in the ignite node to check if a handshake request is processed in a timely manner (ClientListenerNioListener::scheduleHandshakeTimeout). By default the 10 seconds timeout is used (ClientConnectorConfiguration::handshakeTimeout). If timeout expires the client session is closed forcibly.
So, if one or several thin clients fill queue with long operations new clients can not connect.
The real usecase reveals the problem is as follows.
- 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap each
- One TRANSACTIONAL cache with backups=1
- About 30Gb of data on each node
- Several (upto 75 at the same time) thin clients loading data using putAlls in 50000 records batches. Client connects, loads 3 batches and disconnects (spark jobs in fact). In other words clients connect and disconnect constantly.
- Default handshakeTimeout (10 secs) and threadPoolSize(8) in ClientConnectorConfiguration
Two ducktests were created to reproduce and isolate the problem (see the ignitetest/tests/thin_client_concurrency in the attached pull request).
Note that asserts in testcases done in a way that test PASS means that problem IS reproduced. Tests check that at least one thin client fails with the "Channel is closed" error and that server node log contains warning about the handshake timeout.
ThinClientConcurrencyTest::test_many_thin_clients
Mimics the above real life usecase. Several thin client processes invoke putAlls in several threads. There are two sets of parameters - one for TRANSACTIONAL and one for ATOMIC cache.
ThinClientConcurrencyTest::test_one_greedy_thin_client
Minimal test shows that a single thin client can produce such a load that another one can not connect.
On the attached metrics screenshot the behaviour of the test is shown in details:
1. The first thin client invoked and started data load with putAlls in several threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases the client_connector_AcceptedSessions metric (note that _ActiveSessions _ wasn't increased).
test_one_greedy_thin_client.png
The following full stack trace is logged on the client side:
org.apache.ignite.client.ClientConnectionException: Channel is closed at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.<init>(TcpClientChannel.java:180) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:126) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:102) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339) ~[classes/:?] at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181) ~[ignite-ducktests-2. 14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138) [ignite-ducktests-2.14.0-S NAPSHOT.jar:2.14.0-SNAPSHOT] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:111) [ignite-ducktests-2.14.0-S NAPSHOT.jar:2.14.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_322] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322] Caused by: org.apache.ignite.IgniteCheckedException: Channel is closed at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7863) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:591) ~[classes/:?] ... 18 more Caused by: org.apache.ignite.client.ClientConnectionException: Channel is closed at org.apache.ignite.internal.client.thin.TcpClientChannel.close(TcpClientChannel.java:213) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.onDisconnected(TcpClientChannel.java:199) ~[classes/:?] at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientConnection.onDisconnected(GridNioClientConnection.java:95) ~[classes/:?] at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientListener.onDisconnected(GridNioClientListener.java:42) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionClosed(GridNioFilterChain.java:256) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionClosed(GridNioCodecFilter.java:71) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3690) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2877) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2794) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1189) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2508) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2273) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910) ~[classes/:?] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[classes/:?] ... 1 more
Attachments
Attachments
Issue Links
- supercedes
-
IGNITE-15169 Client mass reconnection test
- Patch Available
- links to