Qpid
  1. Qpid
  2. QPID-1871

During Rollback Client Rejects Message after sending TxRollback

    Details

      Description

      Summary:
      See QPID-1864 for annotated log output.

      The log output is from a run with the Java broker, but highlights that the client dispatcher thread is not synchronized with the main thread during rollback.
      As a result the main thread sends the TxRollback before the Dispatcher has sent its Reject message. This results, on the java broker, in the unrejected message being redelivered, which may be out of order depending on what other messages have been released on the message queue.

      If we are to continue to rely on the dispatcher thread rejecting/releasing the message it is currently processing (i.e. the message that is neither in the _queue preDispatchQueue nor the _synchronousQueue for receiver delivery) then we will need to synchronize with the main thread's rollback/recover calls so that the dispatcher can finish processing its message before the rollback/recover completes.

      The message that the Dispatcher thread has can be seen in AMQSession L:2866:dispatchMessage().
      On Rollback we stop the dispatcher (L2763) which can result in the dispatcher thread stopping on L2877 and holding on the the message it is in the middle of delivery. More likely during recover the dispatcher will block on the lock L2870.

      When the dispatcher is restarted (L2792) it is then free to reject its message. However, the thread that restarted the dispatcher's next call is to send the rollback command(L1553) Which is where the race condition occurs.

      Potential Fix:
      Message Rejection should be performed BEFORE we stop the dispatcher.
      On L:2825 we remove the message from the _queue (preDispatchQueue) and then potentailly sit on the message L:2877 when we get stopped.

      If the reject call in L:2888 was before the wait then we could reject the message rather than sit on it.

      Note: Now that I look at this a bit more the rollback (L2754) code looks to be over synchronized. I'm not sure the dispatcher will actually ever stop on the wait L2877 during rollback as the dispatcher is stopped and started again inside the one syncronisation which would prevent the dispatcher getting to the wait. So will more likely block on the sync L2870
      Moving the setConnectionStopped calls out of the sync block along and ensuring that the _rollbackMark is updated before the connection is stopped then we should ok.

      1. AckBug.txt
        898 kB
        Martin Ritchie
      2. RollbackOrderTest.patch
        5 kB
        Martin Ritchie

        Activity

        Hide
        Martin Ritchie added a comment -

        SImply waiting for the dispatcher thread to finish processing sounds good but things become more complicated if you are calling rollback from the dispatcher thread.. such as from onMessage.

        Show
        Martin Ritchie added a comment - SImply waiting for the dispatcher thread to finish processing sounds good but things become more complicated if you are calling rollback from the dispatcher thread.. such as from onMessage.
        Hide
        Martin Ritchie added a comment -

        What I was thinking at the end there was that simply signallying to the dispatcher to stop is not enough. This is partially the problem we have just now.

        We need to know that it as finished processing and hit the wait L2877 before we can safely complete rollback.

        I think the best thing to do would be to push all the rollback code in to the dispatcher and have the dispatcher thread process the rollback. As the Rollback has session wide effect it makes sence to me that the dispatcher thread is the right place to do this work. Otherwise we have to ensure that we have stopped the dispatcher before proceeding. The logic around that stopping will become complex when we have to handle the onMessage case.
        Making the dispatcher thread responsible should make the locking far simpler. It should also be clearer that a call from a different thread should request rollback from the dispatcher and block whilst the dispatcher thread (from onMessage) can perform the rollback.

        To ensure message order is preserved we will need to have a method in to the Transport Layer to know that all buffers have been purged. This will then allow the dispatcher to reject all the messages with confidence that it will not miss one.

        The dispatcher will know when it has procesed all the messages on the session and can safely send the TxRollback.

        Show
        Martin Ritchie added a comment - What I was thinking at the end there was that simply signallying to the dispatcher to stop is not enough. This is partially the problem we have just now. We need to know that it as finished processing and hit the wait L2877 before we can safely complete rollback. I think the best thing to do would be to push all the rollback code in to the dispatcher and have the dispatcher thread process the rollback. As the Rollback has session wide effect it makes sence to me that the dispatcher thread is the right place to do this work. Otherwise we have to ensure that we have stopped the dispatcher before proceeding. The logic around that stopping will become complex when we have to handle the onMessage case. Making the dispatcher thread responsible should make the locking far simpler. It should also be clearer that a call from a different thread should request rollback from the dispatcher and block whilst the dispatcher thread (from onMessage) can perform the rollback. To ensure message order is preserved we will need to have a method in to the Transport Layer to know that all buffers have been purged. This will then allow the dispatcher to reject all the messages with confidence that it will not miss one. The dispatcher will know when it has procesed all the messages on the session and can safely send the TxRollback.
        Hide
        Martin Ritchie added a comment - - edited

        === QPID-1871 Spec ===

        During Rollback Client Rejects Message after sending TxRollback

        Problem:

        Broker does not obey Flow=false when the client calls TxRollback. In the TxRollbackHandler we rollback
        any TxOps then resend any unacknowledged messages.

        The question is who is wrong here.

        • Is the client wrong in requsting TxRollback whilst in a Flowed state.
        • Is the broker wrong for sending messages when the client has requested flow.false.

        Either way the client is erroneously requesting TxRollback before it has fully verified
        that it no longer has any messages in the session.

        The problem occurs here as the Dispatcher has the ability to hold on to a message so when the rollback
        process is beleived to have completed the Dispatcher then rejects the final message AFTER the TxRollback
        so that one message gets sent ahead of the other messages. The reject is dropped as the message has been
        resent.

        Analysis of current process:

        session.rollback()

        0-8 : Client Side

        • Suspend the Channel
        • for ClientAck && Tx
        • Reject all the delivered messages, Reject (requeue=true)
        • if we have a dispatcher (.rollback)
        • verify connection is stopped
        • loops all active consumers
        • rolling them back, if they are Consumers
            • Consumer Rollback
        • clearing their recieve Queue, if they are Browsers
        • If we are in a Transacted Session we hae recorded all the consumers that have been closed.
        • we then rollback and remove each of these recorded consumers from the list.
            • Consumer Rollback
        • restore the connection state(i.e. if we were stopped stay stopped)
        • Send Rollback
        • Mark Session Clean
        • Unsuspend the channel

        Consumer Rollback

        • Not sure why the recorded sessions would need to rollback as the first check in this method is
          to check if there are any messages pending for this consumer.
        • All deregistered consumers will have an empty queue... I would have thought.
        • The contents of the synchronouseQueue are then processed.
        • Any AbstractJMSMessages are rejected (requeue=true)
        • Any thing else is dropped from the queue.

        Java Broker Side

        • all TxOps are rollback
        • TxRollbackOk is sent then
        • channel.resend is called which resends all currently unacked messages

        Other Issues:
        Java doc wrong: rollback does NOT commit all messages!


        Solution Approach:

        I believe there are two possibilities.

        1) If we do not reject the delivered messages then the channel.resend process will resend all the messages
        back to the client. The order of receipt will be identical to the first time.

        2) We can correct the dispatcher flaw so that all messages are correctly rejected so the resend will have
        nothing to do. This means that the order of received messages may change but will be in line with the JMS API.

        • Expired messages will expire
        • Priority messages will be expodited.
        • Also those messages will be made available to other consumers of that queue (Ordering on that client is not defined by JMS)

        Whilst 1 is simpist I beleive performing 2 is the correct thing to do.

        Show
        Martin Ritchie added a comment - - edited === QPID-1871 Spec === During Rollback Client Rejects Message after sending TxRollback Problem: Broker does not obey Flow=false when the client calls TxRollback. In the TxRollbackHandler we rollback any TxOps then resend any unacknowledged messages. The question is who is wrong here. Is the client wrong in requsting TxRollback whilst in a Flowed state. Is the broker wrong for sending messages when the client has requested flow.false. Either way the client is erroneously requesting TxRollback before it has fully verified that it no longer has any messages in the session. The problem occurs here as the Dispatcher has the ability to hold on to a message so when the rollback process is beleived to have completed the Dispatcher then rejects the final message AFTER the TxRollback so that one message gets sent ahead of the other messages. The reject is dropped as the message has been resent. Analysis of current process: session.rollback() 0-8 : Client Side Suspend the Channel for ClientAck && Tx Reject all the delivered messages, Reject (requeue=true) if we have a dispatcher (.rollback) verify connection is stopped loops all active consumers rolling them back, if they are Consumers Consumer Rollback clearing their recieve Queue, if they are Browsers If we are in a Transacted Session we hae recorded all the consumers that have been closed. we then rollback and remove each of these recorded consumers from the list. Consumer Rollback restore the connection state(i.e. if we were stopped stay stopped) Send Rollback Mark Session Clean Unsuspend the channel — Consumer Rollback Not sure why the recorded sessions would need to rollback as the first check in this method is to check if there are any messages pending for this consumer. All deregistered consumers will have an empty queue... I would have thought. The contents of the synchronouseQueue are then processed. Any AbstractJMSMessages are rejected (requeue=true) Any thing else is dropped from the queue. Java Broker Side all TxOps are rollback TxRollbackOk is sent then channel.resend is called which resends all currently unacked messages Other Issues: Java doc wrong: rollback does NOT commit all messages! Solution Approach: I believe there are two possibilities. 1) If we do not reject the delivered messages then the channel.resend process will resend all the messages back to the client. The order of receipt will be identical to the first time. 2) We can correct the dispatcher flaw so that all messages are correctly rejected so the resend will have nothing to do. This means that the order of received messages may change but will be in line with the JMS API. Expired messages will expire Priority messages will be expodited. Also those messages will be made available to other consumers of that queue (Ordering on that client is not defined by JMS) Whilst 1 is simpist I beleive performing 2 is the correct thing to do.
        Hide
        Martin Ritchie added a comment -

        In answering one of the questions above about who is 'wrong'. Nobody. The client must request TxRollback whilst suspsended or it will continue to receive new messages on the queue. The broker is also correct in sending the messages even though the channel is suspended as if the channel is not suspended then new messages on the queue will be received ahead of the redelivered messages.

        Show
        Martin Ritchie added a comment - In answering one of the questions above about who is 'wrong'. Nobody. The client must request TxRollback whilst suspsended or it will continue to receive new messages on the queue. The broker is also correct in sending the messages even though the channel is suspended as if the channel is not suspended then new messages on the queue will be received ahead of the redelivered messages.
        Hide
        Martin Ritchie added a comment -

        Running the modified RollbackOrderTest has highlighted another race condition bug.

        The error the broker reports is that the client is acking an unknown delivery tag. Which suggests that the client has somehow still got a hold of a message from before the rollback, which it then acks but as the unacked map has been cleared(individually rejected, then redelivered) the deilveryTag is no longer valid.

        On the revised version of the RollbackOrderTest this occurs regularly.

        Show
        Martin Ritchie added a comment - Running the modified RollbackOrderTest has highlighted another race condition bug. The error the broker reports is that the client is acking an unknown delivery tag. Which suggests that the client has somehow still got a hold of a message from before the rollback, which it then acks but as the unacked map has been cleared(individually rejected, then redelivered) the deilveryTag is no longer valid. On the revised version of the RollbackOrderTest this occurs regularly.
        Hide
        Martin Ritchie added a comment -

        Current patch in local test.

        Show
        Martin Ritchie added a comment - Current patch in local test.
        Show
        Martin Ritchie added a comment - Change design posted here: http://cwiki.apache.org/confluence/display/qpid/0.6+Java+Client+Dispatcher+Changes
        Hide
        Robbie Gemmell added a comment -

        Closing issue out as part of JIRA cleanup. Issue may already be resolved, may be invalid, or may never be fixed. See QPID-3469 for further details.

        Show
        Robbie Gemmell added a comment - Closing issue out as part of JIRA cleanup. Issue may already be resolved, may be invalid, or may never be fixed. See QPID-3469 for further details.

          People

          • Assignee:
            Unassigned
            Reporter:
            Martin Ritchie
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development