Qpid
  1. Qpid
  2. QPID-3803

System tests SortedQueueTest#test<ACK MODE>SortedQueue fail ocasionally on slow machines

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.15
    • Fix Version/s: 0.15
    • Component/s: Java Tests
    • Labels:
      None

      Description

      Testcase: testClientAckSortedQueue took 9.829 sec
      FAILED
      Incorrect number of messages received expected:<200> but was:<42>
      junit.framework.AssertionFailedError: Incorrect number of messages received expected:<200> but was:<42>
      at org.apache.qpid.server.queue.SortedQueueTest.runThroughSortedQueueForSessionMode(SortedQueueTest.java:154)
      at org.apache.qpid.server.queue.SortedQueueTest.testClientAckSortedQueue(SortedQueueTest.java:122)
      at org.apache.qpid.test.utils.QpidBrokerTestCase.runBare(QpidBrokerTestCase.java:237)
      at org.apache.qpid.test.utils.QpidTestCase.run(QpidTestCase.java:138)

        Activity

        Hide
        Alex Rudyy added a comment -

        Attached a patch with the fix

        Show
        Alex Rudyy added a comment - Attached a patch with the fix
        Hide
        Alex Rudyy added a comment -

        Robbie,

        Could you please review and commit the patch?

        Show
        Alex Rudyy added a comment - Robbie, Could you please review and commit the patch?
        Hide
        Robbie Gemmell added a comment -

        Looking at the changes I would say the increase from 5 to 50 seconds would be better wrapped in an 'if persistent else' so that the extra time is only allotted on persistent runs (which is where the problem occurred due to slow disks), as I think taking over 5 ssecodns for 200 transient messages is something we would want to know about.

        In the changes to the while loop, it now appears like the loop can spin while failing to receive messages in the 1 second timeout. I would probably choose to remove that possibility and just allow more time for an individual receive to succeed (again probably selectively allocating a longer time on persistent runs e.g 3 seconds, as it would be interesting to know that it took longer than a second on a transient run).

        Show
        Robbie Gemmell added a comment - Looking at the changes I would say the increase from 5 to 50 seconds would be better wrapped in an 'if persistent else' so that the extra time is only allotted on persistent runs (which is where the problem occurred due to slow disks), as I think taking over 5 ssecodns for 200 transient messages is something we would want to know about. In the changes to the while loop, it now appears like the loop can spin while failing to receive messages in the 1 second timeout. I would probably choose to remove that possibility and just allow more time for an individual receive to succeed (again probably selectively allocating a longer time on persistent runs e.g 3 seconds, as it would be interesting to know that it took longer than a second on a transient run).
        Hide
        Alex Rudyy added a comment -

        Uploaded a new patch

        Show
        Alex Rudyy added a comment - Uploaded a new patch
        Hide
        Alex Rudyy added a comment -

        Robbie,

        Could you please apply an updated patch?

        I only changed receive and join intervals.
        Depending whether broker is persistent or transient the different values for these intervals are used

        Show
        Alex Rudyy added a comment - Robbie, Could you please apply an updated patch? I only changed receive and join intervals. Depending whether broker is persistent or transient the different values for these intervals are used
        Hide
        Robbie Gemmell added a comment -

        Patch applied.

        Show
        Robbie Gemmell added a comment - Patch applied.
        Hide
        Keith Wall added a comment -

        We are still seeing the occasional failure on one CI box (on bdb profiles).

        Looking at the logs from failing runs, I can see that the consumer did complete within the allowed time (9.6s in the case in front of me), but still the assertion about the number of messages consumed failed. This can only be caused by a publication problem with the _consumed member. The test observes _consumed in a thread safe manner (main), but the update is not thread safe (consumer thread).

        Show
        Keith Wall added a comment - We are still seeing the occasional failure on one CI box (on bdb profiles). Looking at the logs from failing runs, I can see that the consumer did complete within the allowed time (9.6s in the case in front of me), but still the assertion about the number of messages consumed failed. This can only be caused by a publication problem with the _consumed member. The test observes _consumed in a thread safe manner (main), but the update is not thread safe (consumer thread).
        Hide
        Keith Wall added a comment -

        Patch applied.

        Show
        Keith Wall added a comment - Patch applied.
        Hide
        Robbie Gemmell added a comment -

        This failed yet again on the slow machine in question. Looking over the logs showed that the publisher sent the 200 messages but did not complete doing so until well after it appeared the consumer failed the test. This suggests the consumer timed out on receive() while waiting for the publisher (which is writing to a slow RAID6 array). This is probably partly the fault of the consumer being started before the publisher which is then always trying to keep up, although I beleive it is the intent of the test for publication and consumption to occur in parrallel to test the SortedQueue in a more intensive manner than fill-drain so this is somewhat necessary. I have increased all the receive timeouts when running on persistent profiles in atempt to compensate for such publication shortfalls.

        Show
        Robbie Gemmell added a comment - This failed yet again on the slow machine in question. Looking over the logs showed that the publisher sent the 200 messages but did not complete doing so until well after it appeared the consumer failed the test. This suggests the consumer timed out on receive() while waiting for the publisher (which is writing to a slow RAID6 array). This is probably partly the fault of the consumer being started before the publisher which is then always trying to keep up, although I beleive it is the intent of the test for publication and consumption to occur in parrallel to test the SortedQueue in a more intensive manner than fill-drain so this is somewhat necessary. I have increased all the receive timeouts when running on persistent profiles in atempt to compensate for such publication shortfalls.
        Hide
        Robbie Gemmell added a comment -

        The tests don't seem to have failed since the last change, closing out.

        Show
        Robbie Gemmell added a comment - The tests don't seem to have failed since the last change, closing out.

          People

          • Assignee:
            Robbie Gemmell
            Reporter:
            Alex Rudyy
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development