ActiveMQ
  1. ActiveMQ
  2. AMQ-1146

InactivityMonitor: Transport connection disconnected "Channel was inactive for too long."

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1.0
    • Fix Version/s: 5.0.0
    • Component/s: Broker
    • Labels:
      None
    • Environment:

      Windows Server 2003 running broker and Windows XP running client. Java 1.5.0_08

    • Patch Info:
      Patch Available

      Description

      Sometimes (when the transport is idle for long) the transport gets disconnected with the message "Channel was inactive for too long."

      Apparently this is same as reported in http://www.nabble.com/Channel-was-inactive-for-too-long-t1463069.html#a3971744 and http://www.nabble.com/Durable-consumer-reconnect-problem-tf1716817.html#a6147014

      Version: 4.1

      Background:

      The class org.apache.activemq.transport.InactivityMonitor runs monitoring threads to check the read and write of the transport(TCP for example). If it hasn't sent the message during the given period, it sends a KeepAliveInfo to ensure the connectivity. If it doesn't receive the message for a given duration, then it disconnects the transport.

      By default, the maximumInactivityDuration is 30000 milliseconds. This is the time interval between the check for read. That is if an end of the transport doesn't receive any message for this period, then it will close the connection. For the read check, the time interval is half of maximumInactivityDuration i.e. 15000 millisecond. That is if a message was not sent during this period, it will send a KeepAliveInfo to ensure the connectivity.

      Problem and the recommended fix:

      The InactivityMonitor class uses the flags inReceive and commandReceived variables in methods readCheck and onCommand. Typically these two methods are called from different threads. Also, the flags inSend and commandSent are used in methods oneway and writeCheck. Again these two methods are called from different threads. But, the synchronization was missing for these. This seems to be a potential cause for the problem. So, synchronization has been incorporated for these flag usages.

      The same class InactivityMonitor is used at both client and server side. In general, the write check (which sends KeepAliveInfo) happens twice during the time when the read check happens once. Probably the client and server machines have performance differences and just to be safer, now the write check is made to happen thrice during the time when the read check happens once.

      The original and the changed source files are attached here.

      http://www.nabble.com/file/5958/InactivityMonitor_changed.java

      http://www.nabble.com/file/5957/InactivityMonitor_Original.java

        Activity

        Hide
        james strachan added a comment -

        Patch applied with thanks!

        Show
        james strachan added a comment - Patch applied with thanks!
        Hide
        Manuel Teira added a comment - - edited

        I'm having some issues where slow performance in the backend database seems to lead to InactivityIOExceptions. My opinion is that the fix for this issue is causing those problems:

        Having in mind that the design of the readCheck/writeCheck monitors is that we always should have a command among two readCheck (produced by normal activity or by the response of a KeepAlive injected by the writeCheck between two readChecks if needed), synchronizing on the readChecker variable seems a bad idea:

        If the chain of transportListener.onCommand() is lasting more than the negociated MaxInactivityDuration, we will end with two readCheck executing one after the other, without a writeCheck between them. The first readCheck success, sets the commandReceived to false, and the second one is always going to fail, as no new command has been processed, nor a writeCheck has been executed since the last readCheck.

        I don't see the need of synchronizing on the readChecker/writeChecker variables. These are AtomicBooleans and should be safe to access them from different threads (this seemed to be the motivation of the fix). Also, synchronizing the readCheck/writeCheck is actually delaying the execution of the check, and we cannot guarantee anymore the alternating of writeCheck readCheck execution, that is needed for the solution to work.

        Also, synchronizing that is avoiding this code in readCheck() :

             
        synchronized (readChecker) {
                if (inReceive.get()) {
                        LOG.trace("A receive is in progress");
                        return;
                }
        

        to be reached: because inReceive is only set during onCommand, and that is synchronized on readChecker also. Furthermore, the right way to do this is returning here when a onCommand is performed at the same time, and not waiting, to not break the sequence of readCheck/writeCheck invocations.

        In few words, I think that this change breaks the InactivityMonitor, and my opinion is that it should be rolled back. Also in the 4.1 subtask that I opened some time ago.

        Any comments?

        Show
        Manuel Teira added a comment - - edited I'm having some issues where slow performance in the backend database seems to lead to InactivityIOExceptions. My opinion is that the fix for this issue is causing those problems: Having in mind that the design of the readCheck/writeCheck monitors is that we always should have a command among two readCheck (produced by normal activity or by the response of a KeepAlive injected by the writeCheck between two readChecks if needed), synchronizing on the readChecker variable seems a bad idea: If the chain of transportListener.onCommand() is lasting more than the negociated MaxInactivityDuration, we will end with two readCheck executing one after the other, without a writeCheck between them. The first readCheck success, sets the commandReceived to false, and the second one is always going to fail, as no new command has been processed, nor a writeCheck has been executed since the last readCheck. I don't see the need of synchronizing on the readChecker/writeChecker variables. These are AtomicBooleans and should be safe to access them from different threads (this seemed to be the motivation of the fix). Also, synchronizing the readCheck/writeCheck is actually delaying the execution of the check, and we cannot guarantee anymore the alternating of writeCheck readCheck execution, that is needed for the solution to work. Also, synchronizing that is avoiding this code in readCheck() : synchronized (readChecker) { if (inReceive.get()) { LOG.trace( "A receive is in progress" ); return ; } to be reached: because inReceive is only set during onCommand, and that is synchronized on readChecker also. Furthermore, the right way to do this is returning here when a onCommand is performed at the same time, and not waiting, to not break the sequence of readCheck/writeCheck invocations. In few words, I think that this change breaks the InactivityMonitor, and my opinion is that it should be rolled back. Also in the 4.1 subtask that I opened some time ago. Any comments?
        Hide
        Manuel Teira added a comment -

        After discussing the problem with Rob Davies in the IRC channel, I'm reopening this issue to rollback the synchronization for the readCheck/writeCheck.

        Show
        Manuel Teira added a comment - After discussing the problem with Rob Davies in the IRC channel, I'm reopening this issue to rollback the synchronization for the readCheck/writeCheck.
        Hide
        Hiram Chirino added a comment -

        tightened up the synchronizations in trunk. Now we only synchronize to avoid 2 concurrent sends going the next transport or to avoid 2 concurrent commands / exceptions from being delivered to the transport listener.

        Show
        Hiram Chirino added a comment - tightened up the synchronizations in trunk. Now we only synchronize to avoid 2 concurrent sends going the next transport or to avoid 2 concurrent commands / exceptions from being delivered to the transport listener.

          People

          • Assignee:
            Hiram Chirino
            Reporter:
            Mari
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development