TcpCommunicationSpi does not close TCP connections after they have been idle for more than configured in TcpCommunicationSpi#idleConnTimeout amount of time (default is 10 minutes).
There are environments where idle TCP connections become unusable: connections remain ESTABLISHED while actual data to be sent piles up in Send-Q (according to netstat). For this reason Ignite stack does not recognize a communication problem for a considerable amount of time (~ 10-15 minutes), and it does not begin its reconnection procedure (hearbeats use different tcp connections that are not idle and don't have this issue).
I've discovered though there is a logic in the Ignite code to detect and close idle connections. But due to a problem in the code it does not work reliably.
This is a test that sometimes reproduces the problem.
ignite_idle_test.zip - full test project
GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java - just test code
2.6.0.txt - mvn clean install logs for test with Ignite 2.6.0
What's the problem in the Ignite code?
There are two loops in the Ignite code that have a chance to close idle connections:
1) org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.CommunicationWorker#processIdle - this one is executed each IdleConnectionTimeout milliseconds. (it can close idle connections but it typically turns out that it thinks that connection is not idle, thanks to the second loop).
2) org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#bodyInternal -> org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#checkIdle - this loop executes:
To wind up, may be the whole approach should be reviewed:
- is it ok not to track message delivery time?
- is it ok not to do heartbeating using the same connections as for get/put/... commands?