Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-10469

TcpCommunicationSpi does not break tcp connection after IdleConnectionTimeout seconds of inactivity

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.5, 2.6
    • 2.7
    • cache
    • None

    Description

      TcpCommunicationSpi does not close TCP connections after they have been idle for more than configured in TcpCommunicationSpi#idleConnTimeout amount of time (default is 10 minutes).

      There are environments where idle TCP connections become unusable: connections remain ESTABLISHED while actual data to be sent piles up in Send-Q (according to netstat). For this reason Ignite stack does not recognize a communication problem for a considerable amount of time (~ 10-15 minutes), and it does not begin its reconnection procedure (hearbeats use different tcp connections that are not idle and don't have this issue).

      I've discovered though there is a logic in the Ignite code to detect and close idle connections. But due to a problem in the code it does not work reliably.

      This is a test that sometimes reproduces the problem.
      ignite_idle_test.zip - full test project
      GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java - just test code
      2.6.0.txt - mvn clean install logs for test with Ignite 2.6.0

      What's the problem in the Ignite code?

      There are two loops in the Ignite code that have a chance to close idle connections:
      1) org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.CommunicationWorker#processIdle - this one is executed each IdleConnectionTimeout milliseconds. (it can close idle connections but it typically turns out that it thinks that connection is not idle, thanks to the second loop).
      2) org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#bodyInternal -> org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#checkIdle - this loop executes:

      filterChain.onSessionIdleTimeout(ses); <-- does not actually close an idle connection
      // Update timestamp to avoid multiple notifications within one timeout interval.
      ses.resetSendScheduleTime(); <--- resets idle timer
      ses.bytesReceived(0);
      


      To wind up, may be the whole approach should be reviewed:

      • is it ok not to track message delivery time?
      • is it ok not to do heartbeating using the same connections as for get/put/... commands?

      Attachments

        1. ignite_idle_test.zip
          5 kB
          Igor Kamyshnikov
        2. GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java
          7 kB
          Igor Kamyshnikov
        3. GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java
          7 kB
          Evgeny Stanilovsky
        4. 2.6.0.txt
          93 kB
          Igor Kamyshnikov

        Activity

          People

            zstan Evgeny Stanilovsky
            kamyshnikov Igor Kamyshnikov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: