Qpid
  1. Qpid
  2. QPID-2994

transactions atomicity violated by 'transparent' failover

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6, 0.7, 0.8
    • Fix Version/s: 0.10
    • Component/s: Java Client
    • Labels:
      None

      Description

      The messages published within a batch at the point the connection failsover appear to be replayed outside of any transaction.

      Steps to Reproduce:
      1. start transactional session on failover enabled connection
      2. send batches of messages in transactions
      3. kill the cluster node the client is connected to, to trigger failover mid
      transaction

      This happens due to the lower layer replaying unacked messages upon resuming the connection.
      Message replay should not happen on a transacted session as there is no benefit of doing so.

      1. QPID-2994.patch
        4 kB
        Rajith Attapattu

        Activity

        Hide
        Gordon Sim added a comment -

        We should also ensure that in failover an exception indicating the automatic rollback of the transaction is thrown to that can be caught and used to redo the transaction.

        Show
        Gordon Sim added a comment - We should also ensure that in failover an exception indicating the automatic rollback of the transaction is thrown to that can be caught and used to redo the transaction.
        Hide
        Rajith Attapattu added a comment - - edited

        The commit made in rev 1057460 uncovered a more deeper issue that violates the atomicity of a transaction that was disrupted by failover.
        The symptom was one or two messages seems to get onto the queue outside of the transaction boundaries.
        Upon closer inspection these were messages that were in the failed transaction. If the application re-tries the failed transaction it results in duplicates further complicating the issue.

        The underlying root cause is as follows.
        1. When a message-transfer reaches the invoke method in Session.java and if the session-state is detached at that time, the thread waits until the session is OPEN or CLOSED.

        2. If failover completes within the wait period and the session is resumed, thereby being marked OPEN and the message transfer in progress just resumes and reaches the broker.

        3. At this point the session is still not marked transactional (and there is no logic in place to ever issue a txSelect after failover as well) so the message is enqueued.

        4. In the meantime the JMS session used by the application gets to know that failover happens and is marked dirty and an exception is received.

        5. If the application chooses to resume the session (ignoring the exception) then subsequent message transfers will get to the queue on the broker but the session will get closed once it sends a commit (or a rollback) as the broker will complain that the session is not transactional.

        6. If the application chooses to create a new session then it will start sending sub sequent messages within transaction boundaries and work as expected. But will still have that extra one or two messages that sneaked in when the old session was reopned. If the application retired the aborted transaction then it will result in duplicates due to the messages that sneaked in.

        A reasonable solution to this issue is to,
        1) Close a session marked transactional immediately when the session detaches. i.e a transactional session is never resumed and a new session should be created to continue.

        2) We also need to document that clearly.

        Also during investigation I found a race condition where an application could create a new session (recreating due to an exception or a completely new session in the midst of failover) before the connection is open.
        This results in session attach being sent before the connection negotiation is completed. All though the connect method and the createSession method in Connection.java contends for the same lock, the connect method which acquires it early, will releases the lock when it waits (until the connection achieves OPEN state) and the createSession method waiting on the lock will get it and continue. This actually exposed a bug in the C++ broker. See QPID-3033
        We need to ensure that createSession method is not executed until the connection achieve OPEN state. I will open a separate JIRA for this.

        (*)Another race condition found is that if a session is created (after the connection is setup and is marked OPEN) but before the resume method (in Connection.java) is called, it results in the new session being reattached again. This could result in unnecessary duplication of messages.
        We need to ensure that createSession method does not get executed until the resume method is completed. Again I will open a separate JIRA for this.

        Show
        Rajith Attapattu added a comment - - edited The commit made in rev 1057460 uncovered a more deeper issue that violates the atomicity of a transaction that was disrupted by failover. The symptom was one or two messages seems to get onto the queue outside of the transaction boundaries. Upon closer inspection these were messages that were in the failed transaction. If the application re-tries the failed transaction it results in duplicates further complicating the issue. The underlying root cause is as follows. 1. When a message-transfer reaches the invoke method in Session.java and if the session-state is detached at that time, the thread waits until the session is OPEN or CLOSED. 2. If failover completes within the wait period and the session is resumed, thereby being marked OPEN and the message transfer in progress just resumes and reaches the broker. 3. At this point the session is still not marked transactional (and there is no logic in place to ever issue a txSelect after failover as well) so the message is enqueued. 4. In the meantime the JMS session used by the application gets to know that failover happens and is marked dirty and an exception is received. 5. If the application chooses to resume the session (ignoring the exception) then subsequent message transfers will get to the queue on the broker but the session will get closed once it sends a commit (or a rollback) as the broker will complain that the session is not transactional. 6. If the application chooses to create a new session then it will start sending sub sequent messages within transaction boundaries and work as expected. But will still have that extra one or two messages that sneaked in when the old session was reopned. If the application retired the aborted transaction then it will result in duplicates due to the messages that sneaked in. A reasonable solution to this issue is to, 1) Close a session marked transactional immediately when the session detaches. i.e a transactional session is never resumed and a new session should be created to continue. 2) We also need to document that clearly. Also during investigation I found a race condition where an application could create a new session (recreating due to an exception or a completely new session in the midst of failover) before the connection is open. This results in session attach being sent before the connection negotiation is completed. All though the connect method and the createSession method in Connection.java contends for the same lock, the connect method which acquires it early, will releases the lock when it waits (until the connection achieves OPEN state) and the createSession method waiting on the lock will get it and continue. This actually exposed a bug in the C++ broker. See QPID-3033 We need to ensure that createSession method is not executed until the connection achieve OPEN state. I will open a separate JIRA for this. (*)Another race condition found is that if a session is created (after the connection is setup and is marked OPEN) but before the resume method (in Connection.java) is called, it results in the new session being reattached again. This could result in unnecessary duplication of messages. We need to ensure that createSession method does not get executed until the resume method is completed. Again I will open a separate JIRA for this.
        Hide
        Rajith Attapattu added a comment -

        The patch contains a fix for QPID-2994, QPID-3042 and QPID-3043.

        1. For QPID-2994 - If a transactional session gets detached the session is now removed and an exception is thrown.

        2. For QPID-3042 - The createSession method now waits until the connection state == OPEN before it issues a session attach.

        3. QPID-3043 - A failover-lock is now used to ensure that session.create does not proceed while the 'resume' method is in progress. However we should also consider the possibility of the race condition where sessionCreate is called before session resume is even started. There this fix is incomplete.

        Show
        Rajith Attapattu added a comment - The patch contains a fix for QPID-2994 , QPID-3042 and QPID-3043 . 1. For QPID-2994 - If a transactional session gets detached the session is now removed and an exception is thrown. 2. For QPID-3042 - The createSession method now waits until the connection state == OPEN before it issues a session attach. 3. QPID-3043 - A failover-lock is now used to ensure that session.create does not proceed while the 'resume' method is in progress. However we should also consider the possibility of the race condition where sessionCreate is called before session resume is even started. There this fix is incomplete.
        Hide
        Rajith Attapattu added a comment -

        We do throw an exception indicating that failover has happened and the session is closed.
        However I think we need to explicitly mention that the transaction is rollbacked.

        We also need to document,
        1. Which exceptions will destroy a session
        2. And which exceptions can be handled and continue with the session.

        Show
        Rajith Attapattu added a comment - We do throw an exception indicating that failover has happened and the session is closed. However I think we need to explicitly mention that the transaction is rollbacked. We also need to document, 1. Which exceptions will destroy a session 2. And which exceptions can be handled and continue with the session.
        Hide
        Rajith Attapattu added a comment -

        Fixed along with related issues - QPID-3042 & QPID-3043.
        Tested manually and added a test case to the test harness in testkit.py

        Show
        Rajith Attapattu added a comment - Fixed along with related issues - QPID-3042 & QPID-3043 . Tested manually and added a test case to the test harness in testkit.py

          People

          • Assignee:
            Rajith Attapattu
            Reporter:
            Rajith Attapattu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development