Qpid
  1. Qpid
  2. QPID-2900

CommitRollbackTest#testGetThenRollback and CommitRollbackTest#testSend2ThenRollback fail on java.0.10 profile

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9
    • Component/s: Java Broker, Java Client
    • Labels:
      None

      Description

      org.apache.qpid.test.unit.transacted.CommitRollbackTest#testGetThenRollback
      org.apache.qpid.test.unit.transacted.CommitRollbackTest#testSend2ThenRollback

      These two tests fail very regularly (but not always) on the java.0.10 test profile:

      TestName: testGetThenRollback Duration: 6.882
      test message was consumed and rolled back, but is gone

      junit.framework.AssertionFailedError: test message was consumed and rolled back, but is gone
      at org.apache.qpid.test.unit.transacted.CommitRollbackTest.testGetThenRollback(CommitRollbackTest.java:273)
      at org.apache.qpid.test.utils.QpidBrokerTestCase.runBare(QpidBrokerTestCase.java:232)
      at org.apache.qpid.test.utils.QpidTestCase.run(QpidTestCase.java:120)

      TestName: testSend2ThenRollback Duration: 6.879
      Didn't receive redelivered message one

      junit.framework.AssertionFailedError: Didn't receive redelivered message one
      at org.apache.qpid.test.unit.transacted.CommitRollbackTest.verifyMessages(CommitRollbackTest.java:379)
      at org.apache.qpid.test.unit.transacted.CommitRollbackTest.verifyMessages(CommitRollbackTest.java:415)
      at org.apache.qpid.test.unit.transacted.CommitRollbackTest.testSend2ThenRollback(CommitRollbackTest.java:357)
      at org.apache.qpid.test.utils.QpidBrokerTestCase.runBare(QpidBrokerTestCase.java:232)
      at org.apache.qpid.test.utils.QpidTestCase.run(QpidTestCase.java:120)

        Activity

        Hide
        Robbie Gemmell added a comment -

        Deferring to 0.9

        Show
        Robbie Gemmell added a comment - Deferring to 0.9
        Hide
        Keith Wall added a comment -

        I've traced the problem down to a race condition - broker side - within SimpleAMQQueue.

        The nub of the problem is the update of the atomic QueueContext._releasedEntry. There is a circumstance where _releasedEntry is overwritten prematurely and therefore the broker fails to resend the message to the client.

        Here's how it happens:

        In the cases where the tests fail, the thread (SubFlushRunner or QueueRunner) executing SimpleAMQQueue.deliverMessage() is yielded part way through the method. The thread has successfully sent the message to the client (MESSAGE_TRANSFER etc), but the JVM yields the thread before it reaches setLastSeen(). (SetLastSeen is responsible for advancing the lastSeen pointer, and resetting _releasedEntry if necessary).

        Meanwhile, the client receives the message, calls JMS rollback, which causes MESSASE_RELEASE (etc) command to be sent to the broker, before it calls JMS receive(). The broker processes the MESSAGE_RELEASE by invoking SimpleAMQ.requeue() which in turn invokes updateSubRequeueEntry() to update _releaseEntry with the QueueEntry to be resent.

        Now, the JVM wakes-up the thread (SubFlushRunner or QueueRunner) that was part way through processing SimpleAMQQueue .deliverMessage() for the initial send of the message. The setLastSeen() method is invoked, which **incorrectly** resets _releasedEntry to null because _releaseEntry now is the same as the entry being processed. The broker goes around the loop, calls getNextAvailableEntry which returns null because _releaseEntry has been prematurely cleared.

        I've illustrated the success and failure cases as sequence diagrams (attached to this JIRA).

        Show
        Keith Wall added a comment - I've traced the problem down to a race condition - broker side - within SimpleAMQQueue. The nub of the problem is the update of the atomic QueueContext._releasedEntry. There is a circumstance where _releasedEntry is overwritten prematurely and therefore the broker fails to resend the message to the client. Here's how it happens: In the cases where the tests fail, the thread (SubFlushRunner or QueueRunner) executing SimpleAMQQueue.deliverMessage() is yielded part way through the method. The thread has successfully sent the message to the client (MESSAGE_TRANSFER etc), but the JVM yields the thread before it reaches setLastSeen(). (SetLastSeen is responsible for advancing the lastSeen pointer, and resetting _releasedEntry if necessary). Meanwhile, the client receives the message, calls JMS rollback, which causes MESSASE_RELEASE (etc) command to be sent to the broker, before it calls JMS receive(). The broker processes the MESSAGE_RELEASE by invoking SimpleAMQ.requeue() which in turn invokes updateSubRequeueEntry() to update _releaseEntry with the QueueEntry to be resent. Now, the JVM wakes-up the thread (SubFlushRunner or QueueRunner) that was part way through processing SimpleAMQQueue .deliverMessage() for the initial send of the message. The setLastSeen() method is invoked, which ** incorrectly ** resets _releasedEntry to null because _releaseEntry now is the same as the entry being processed. The broker goes around the loop, calls getNextAvailableEntry which returns null because _releaseEntry has been prematurely cleared. I've illustrated the success and failure cases as sequence diagrams (attached to this JIRA).
        Hide
        Keith Wall added a comment -

        Two sequence diagrams to illustrate the issue.

        Show
        Keith Wall added a comment - Two sequence diagrams to illustrate the issue.
        Hide
        Keith Wall added a comment -

        Patch to address this JIRA.

        SimpleAMQQueue changed to eliminate race condition.

        Bolstered unit tests around the area of the change.

        Changed TestableMemoryMessageStore to avoid a NPE. Previously this was causing a NPE to be logged during tearDown of some broker unit tests (including SimpleAMQQueueTest) but not causing test failures.

        Show
        Keith Wall added a comment - Patch to address this JIRA. SimpleAMQQueue changed to eliminate race condition. Bolstered unit tests around the area of the change. Changed TestableMemoryMessageStore to avoid a NPE. Previously this was causing a NPE to be logged during tearDown of some broker unit tests (including SimpleAMQQueueTest) but not causing test failures.
        Hide
        Keith Wall added a comment -

        Hi Robbie - can you review this patch please?

        Show
        Keith Wall added a comment - Hi Robbie - can you review this patch please?
        Hide
        Robbie Gemmell added a comment -

        Patch applied.

        Show
        Robbie Gemmell added a comment - Patch applied.

          People

          • Assignee:
            Robbie Gemmell
            Reporter:
            Robbie Gemmell
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development