Qpid
  1. Qpid
  2. QPID-3259

Deadlock on Java client side while closing session when topic operation is unauthorized

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Java Client
    • Labels:
    • Environment:
    1. QPID-3259-SampleClient
      1 kB
      Danushka Menikkumbura
    2. QPID-3259-ThreadDump
      6 kB
      Danushka Menikkumbura

      Activity

      Hide
      Keith Wall added a comment -

      Could you update this JIRA with the version number(s) of the broker/client that are affected by the problem? Would you also have a test case that demonstrates the defect?

      Kind regards, Keith.

      Show
      Keith Wall added a comment - Could you update this JIRA with the version number(s) of the broker/client that are affected by the problem? Would you also have a test case that demonstrates the defect? Kind regards, Keith.
      Hide
      Rajith Attapattu added a comment -

      From what you described, I am reasonably sure that this is a known issue - see QPID-3214
      This is already fixed on trunk.

      Can you please post the trace from a thread dump so we know its the same issue.
      If it is, then I will close this as a duplicate of QPID-3214.

      Show
      Rajith Attapattu added a comment - From what you described, I am reasonably sure that this is a known issue - see QPID-3214 This is already fixed on trunk. Can you please post the trace from a thread dump so we know its the same issue. If it is, then I will close this as a duplicate of QPID-3214 .
      Hide
      Rajith Attapattu added a comment -

      "The other issue in this case is that AMQException.isHardError always returns true and hence the connection tried to close all sessions inside exceptionReceived method. I think there is something wrong here as an unauthorized operation in one session should not lead to closing all other sessions."

      I also have mixed feelings about the above and have raised that issue on the dev list.
      However after speaking to Robert Godfrey I agree that in some cases the only way to know a session error is if it's notified via the Connection Listener.
      (In JMS there is no way to report a session exception if the client is in a passive mode, ex using a message listener).

      Therefore the alternative is to notify such an error via the Connection Listener. In the case where multiple sessions are used, this does have the unintended result of all sessions for that connection being closed. But in a case where there is only one session, a session exception is as fatal as a connection level exception. If this goes unreported when a client is say using a Message listener, the client maybe waiting for messages that will never arrive bcos the underlying session is closed. So considering all these facts I believe that the connection being closed is a necessary inconvenience to avoid a more serious issue.

      Show
      Rajith Attapattu added a comment - "The other issue in this case is that AMQException.isHardError always returns true and hence the connection tried to close all sessions inside exceptionReceived method. I think there is something wrong here as an unauthorized operation in one session should not lead to closing all other sessions." I also have mixed feelings about the above and have raised that issue on the dev list. However after speaking to Robert Godfrey I agree that in some cases the only way to know a session error is if it's notified via the Connection Listener. (In JMS there is no way to report a session exception if the client is in a passive mode, ex using a message listener). Therefore the alternative is to notify such an error via the Connection Listener. In the case where multiple sessions are used, this does have the unintended result of all sessions for that connection being closed. But in a case where there is only one session, a session exception is as fatal as a connection level exception. If this goes unreported when a client is say using a Message listener, the client maybe waiting for messages that will never arrive bcos the underlying session is closed. So considering all these facts I believe that the connection being closed is a necessary inconvenience to avoid a more serious issue.
      Hide
      Danushka Menikkumbura added a comment -

      Hi Keith,

      Please find the attached thread dump (QPID-3259-ThreadDump) and sample client (QPID-3259-SampleClient). You can reproduce this issue easily with the latest trunk (rev. 1124557) using the sample client programme. You need to set permission so that the user guest can not publish to topic MyTopic. I think you can simply do this (a shortcut ) by changing the execution path in ServerSessionDelegate.java(307) to trigger an exception even when the operation is authorized.

      Thanks,
      Danushka

      Show
      Danushka Menikkumbura added a comment - Hi Keith, Please find the attached thread dump ( QPID-3259 -ThreadDump) and sample client ( QPID-3259 -SampleClient). You can reproduce this issue easily with the latest trunk (rev. 1124557) using the sample client programme. You need to set permission so that the user guest can not publish to topic MyTopic. I think you can simply do this (a shortcut ) by changing the execution path in ServerSessionDelegate.java(307) to trigger an exception even when the operation is authorized. Thanks, Danushka
      Hide
      Danushka Menikkumbura added a comment -

      Hi Rajith,

      I think we need to fix AMQException.isHardError() so that it correctly figures out the level of severity rather than returning "true" all the time so that we know when to close all sessions and eventually the connection.

      The normal usecase is that we create one connection and multiple sessions out of it. Therefore an unauthorized operation in one of them should not cause others to die.

      Thanks,
      Danushka

      Show
      Danushka Menikkumbura added a comment - Hi Rajith, I think we need to fix AMQException.isHardError() so that it correctly figures out the level of severity rather than returning "true" all the time so that we know when to close all sessions and eventually the connection. The normal usecase is that we create one connection and multiple sessions out of it. Therefore an unauthorized operation in one of them should not cause others to die. Thanks, Danushka
      Hide
      Rajith Attapattu added a comment -

      Danushka see QPID-3234 for the reasons why isHardError is implemented the way it is.
      If you feel strongly about it, you could continue the discussion there.

      Show
      Rajith Attapattu added a comment - Danushka see QPID-3234 for the reasons why isHardError is implemented the way it is. If you feel strongly about it, you could continue the discussion there.
      Hide
      Rajith Attapattu added a comment -

      This is not a deadlock, all though it does give the illusion of one. If you let it sit for a while, you will see a timeout like the following.
      "Caused by: org.apache.qpid.AMQException: timed out waiting for sync: complete = 0, point = 4 [error code 541: internal error]"

      (There is also a race condition here which will determine if this quasi deadlock happens or not - which I have described under reproducibility).

      Root Cause
      -------------
      This quasi deadlock happens due to the following.
      When the queue capacity exceeds the broker closes the session and issues an Execution Exception, which results in the client trying to notify it via "AMQConnection.exceptionReceived". Inside this method it needs to acquire the "FailoverMutex" to proceed.

      Meanwhile the test code has invoked session.close() which has already obtained the FailoverMutex, and has issued a session close and has called sync() waiting for the broker to respond. The broker will not respond bcos that session is already closed.

      Meanwhile the IO Thread which drives the setting of the exception is blocked waiting for the FailoverMutex (in side "AMQConnection.exceptionReceived).

      This gives the illusion of a deadlock. Eventually the sync call times out and you will see the above error message.

      Reproducibility
      ----------------
      As mentioned at the beginning there is a race condition which determines if this issue happens or not.
      If the execution exception is notified before the session close is invoked this will not happen.

      Ex. If you put in a delay in the test code (say Thread.sleep(1000) before session.close()) you will see that this issue does not arise.

      Fix


      It's not trivial to fix this issue. As mentioned previously the FailoverMutex (which I honestly believes is the root cause of all evil) is used to protect many operations. Simply rearranging the code to side step the issue could result in unexpected behaviour.

      Both failover and error handling logic may need some tweaking to fix these issues properly. I suspect there maybe more issues like this.

      Show
      Rajith Attapattu added a comment - This is not a deadlock, all though it does give the illusion of one. If you let it sit for a while, you will see a timeout like the following. "Caused by: org.apache.qpid.AMQException: timed out waiting for sync: complete = 0, point = 4 [error code 541: internal error] " (There is also a race condition here which will determine if this quasi deadlock happens or not - which I have described under reproducibility). Root Cause ------------- This quasi deadlock happens due to the following. When the queue capacity exceeds the broker closes the session and issues an Execution Exception, which results in the client trying to notify it via "AMQConnection.exceptionReceived". Inside this method it needs to acquire the "FailoverMutex" to proceed. Meanwhile the test code has invoked session.close() which has already obtained the FailoverMutex, and has issued a session close and has called sync() waiting for the broker to respond. The broker will not respond bcos that session is already closed. Meanwhile the IO Thread which drives the setting of the exception is blocked waiting for the FailoverMutex (in side "AMQConnection.exceptionReceived). This gives the illusion of a deadlock. Eventually the sync call times out and you will see the above error message. Reproducibility ---------------- As mentioned at the beginning there is a race condition which determines if this issue happens or not. If the execution exception is notified before the session close is invoked this will not happen. Ex. If you put in a delay in the test code (say Thread.sleep(1000) before session.close()) you will see that this issue does not arise. Fix It's not trivial to fix this issue. As mentioned previously the FailoverMutex (which I honestly believes is the root cause of all evil) is used to protect many operations. Simply rearranging the code to side step the issue could result in unexpected behaviour. Both failover and error handling logic may need some tweaking to fix these issues properly. I suspect there maybe more issues like this.
      Hide
      Danushka Menikkumbura added a comment -

      Hi Rajith,

      This is actually a deadlock, session lock times out after some time though.

      Sadly Qpid is full of such fundamental issue. I just unveiled a similar issue when I tried to close session after sending large number of messages in a row.

      I think we need to get into hackathon mode and fix them. Qpid is a great product but if you try to use it in real production environment you will end up with all sorts of weird behaviours.

      Danushka

      Show
      Danushka Menikkumbura added a comment - Hi Rajith, This is actually a deadlock, session lock times out after some time though. Sadly Qpid is full of such fundamental issue. I just unveiled a similar issue when I tried to close session after sending large number of messages in a row. I think we need to get into hackathon mode and fix them. Qpid is a great product but if you try to use it in real production environment you will end up with all sorts of weird behaviours. Danushka
      Hide
      Rajith Attapattu added a comment -

      IMO a deadlock is a condition where the threads involved cannot recover from it (i.e blocked forever).
      In this case there is a timeout, which will eventually release the lock.

      Technicalities aside, I believe the way to fix this is to untangle the error handling code.
      This can happen in any operation that executes synchronously and wait for an error.

      If we invoke an operation synchronously we always look for an error and then try to notify it to the connection as well.
      On the other hand we also try to notify it directly to the connection.
      These two code paths will compete with each other for the locks along the way (most notably the failover mutex).

      IMO if we get an error during a synchronous operation, we should just throw a JMS Exception that says the session is closed and not worry about reporting it to the connection (bcos the application already knows about it). This will also prevent the undesirable effect of closing all the other sessions for this connection even when there is a clear way of notifying the session is closed.

      If this error occurred while being in a passive mode (Ex using a message listener), then we should definitely notify via the exception listener.

      For a connection exception we do have a separate code path to take care of notifying the connection, so the above proposed change will not have an effect.

      Show
      Rajith Attapattu added a comment - IMO a deadlock is a condition where the threads involved cannot recover from it (i.e blocked forever). In this case there is a timeout, which will eventually release the lock. Technicalities aside, I believe the way to fix this is to untangle the error handling code. This can happen in any operation that executes synchronously and wait for an error. If we invoke an operation synchronously we always look for an error and then try to notify it to the connection as well. On the other hand we also try to notify it directly to the connection. These two code paths will compete with each other for the locks along the way (most notably the failover mutex). IMO if we get an error during a synchronous operation, we should just throw a JMS Exception that says the session is closed and not worry about reporting it to the connection (bcos the application already knows about it). This will also prevent the undesirable effect of closing all the other sessions for this connection even when there is a clear way of notifying the session is closed. If this error occurred while being in a passive mode (Ex using a message listener), then we should definitely notify via the exception listener. For a connection exception we do have a separate code path to take care of notifying the connection, so the above proposed change will not have an effect.

        People

        • Assignee:
          Rajith Attapattu
          Reporter:
          Danushka Menikkumbura
        • Votes:
          0 Vote for this issue
          Watchers:
          0 Start watching this issue

          Dates

          • Created:
            Updated:

            Development