Qpid
  1. Qpid
  2. QPID-3890

Improve performance by reducing the use of the Queue's message lock

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.16
    • Fix Version/s: Future
    • Component/s: C++ Broker
    • Labels:
      None

      Description

      The mutex that protects the Queue's message container is held longer than it really should be (e.g. across statistics updates, etc).

      Reducing the amount of code executed while the lock is held will improve the performance of the broker in a multi-process environment.

        Activity

        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review6116
        -----------------------------------------------------------

        Ship it!

        I would still like to see a comment in Queue.h that states explicitly which member variables are covered by which locks - while its fresh in your mind

        • Alan

        On 2012-03-19 22:22:42, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-19 22:22:42)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1302689

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1302689

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1302689

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1302689

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review6116 ----------------------------------------------------------- Ship it! I would still like to see a comment in Queue.h that states explicitly which member variables are covered by which locks - while its fresh in your mind Alan On 2012-03-19 22:22:42, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-19 22:22:42) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1302689 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1302689 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1302689 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1302689 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/
        -----------------------------------------------------------

        (Updated 2012-03-19 22:22:42.524210)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Changes
        -------

        I've applied most of the review recommendations, reverting some of the changes that lead to race conditions. The result is a bit slower than the original, but shows promise esp. in the multi-consumer/multi-producer case.

        Rerun of the tests given the latest changes (against lastest trunk):

        Test1
        trunk qpid-3890
        tp(m/s) tp(m/s)
        70907 :: 70776 :: 112450 81417 :: 75767 :: 115527
        73190 :: 71805 :: 107750 81132 :: 75067 :: 114962
        67628 :: 66340 :: 106461 75638 :: 73906 :: 110000
        64801 :: 64055 :: 104864 77083 :: 75184 :: 106368
        69654 :: 66441 :: 103480 72933 :: 72518 :: 105773

        Test2
        trunk qpid-3890
        49921 :: 49761 :: 49724 :: 49674 60263 :: 60234 :: 60211 :: 60111
        49672 :: 49384 :: 49372 :: 49279 59781 :: 59545 :: 59770 :: 59529
        48138 :: 48107 :: 48097 :: 48056 59792 :: 59506 :: 59571 :: 59426
        48228 :: 48216 :: 48061 :: 48009 59002 :: 58847 :: 58972 :: 58779

        Test3
        trunk qpid-3890
        84372 :: 84320 :: 92689 81960 :: 80475 :: 96154
        88671 :: 84633 :: 85234 83841 :: 82777 :: 86037
        81302 :: 81073 :: 79898 83740 :: 81476 :: 85572

        I'd like to put this on trunk now that we've branched - the earlier the better.

        Opinions?

        Summary
        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.
        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1
        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730
        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649
        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617
        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513
        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2
        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638
        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157
        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d
        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $DST_HOST add exchange fanout dstX1
        qpid-config -b $DST_HOST bind dstX1 dstQ1
        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1
        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2
        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1
        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418
        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386
        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        • fix message group errors
        • verify/fix cluster

        This addresses bug qpid-3890.
        https://issues.apache.org/jira/browse/qpid-3890

        Diffs (updated)


        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1302689
        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1302689
        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1302689
        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1302689

        Diff: https://reviews.apache.org/r/4232/diff

        Testing
        -------

        unit test (non-cluster)
        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-19 22:22:42.524210) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Changes ------- I've applied most of the review recommendations, reverting some of the changes that lead to race conditions. The result is a bit slower than the original, but shows promise esp. in the multi-consumer/multi-producer case. Rerun of the tests given the latest changes (against lastest trunk): Test1 trunk qpid-3890 tp(m/s) tp(m/s) 70907 :: 70776 :: 112450 81417 :: 75767 :: 115527 73190 :: 71805 :: 107750 81132 :: 75067 :: 114962 67628 :: 66340 :: 106461 75638 :: 73906 :: 110000 64801 :: 64055 :: 104864 77083 :: 75184 :: 106368 69654 :: 66441 :: 103480 72933 :: 72518 :: 105773 Test2 trunk qpid-3890 49921 :: 49761 :: 49724 :: 49674 60263 :: 60234 :: 60211 :: 60111 49672 :: 49384 :: 49372 :: 49279 59781 :: 59545 :: 59770 :: 59529 48138 :: 48107 :: 48097 :: 48056 59792 :: 59506 :: 59571 :: 59426 48228 :: 48216 :: 48061 :: 48009 59002 :: 58847 :: 58972 :: 58779 Test3 trunk qpid-3890 84372 :: 84320 :: 92689 81960 :: 80475 :: 96154 88671 :: 84633 :: 85234 83841 :: 82777 :: 86037 81302 :: 81073 :: 79898 83740 :: 81476 :: 85572 I'd like to put this on trunk now that we've branched - the earlier the better. Opinions? Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: fix message group errors verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs (updated) /trunk/qpid/cpp/src/qpid/broker/Queue.h 1302689 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1302689 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1302689 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1302689 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-08 14:52:43, Alan Conway wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148

        > <https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148>

        >

        > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic.

        Kenneth Giusti wrote:

        What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it.

        Alan Conway wrote:

        Yes. According to the diff above, your patch changed a bunch of observer functions from having a ScopedLock& argument to having LH suffixes. Did you pick up some unintended changes?

        Oops! At one point I was trying to limit the lock within the observe* call, which introduced ordering issues. I'll revert back to the original - thanks.

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5723
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-08 14:52:43, Alan Conway wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148 > < https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148 > > > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic. Kenneth Giusti wrote: What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it. Alan Conway wrote: Yes. According to the diff above, your patch changed a bunch of observer functions from having a ScopedLock& argument to having LH suffixes. Did you pick up some unintended changes? Oops! At one point I was trying to limit the lock within the observe* call, which introduced ordering issues. I'll revert back to the original - thanks. Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5723 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-08 14:52:43, Alan Conway wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148

        > <https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148>

        >

        > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic.

        Kenneth Giusti wrote:

        What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it.

        Yes. According to the diff above, your patch changed a bunch of observer functions from having a ScopedLock& argument to having LH suffixes. Did you pick up some unintended changes?

        • Alan

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5723
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-08 14:52:43, Alan Conway wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148 > < https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148 > > > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic. Kenneth Giusti wrote: What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it. Yes. According to the diff above, your patch changed a bunch of observer functions from having a ScopedLock& argument to having LH suffixes. Did you pick up some unintended changes? Alan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5723 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        Gordon Sim wrote:

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        Kenneth Giusti wrote:

        Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with).

        Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along?

        Gordon Sim wrote:

        My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved.

        Gordon Sim wrote:

        re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock).

        Kenneth Giusti wrote:

        I've moved the listeners.populate() calls back under the lock. There is a performance degradation, but there's still an overall improvement:

        Funnel Test: (3 worker threads)

        trunk: patched: patch-2 (locked listeners):

        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 91857 :: 87240 :: 133878

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 84184 :: 82794 :: 126980

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 81846 :: 79640 :: 124192

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 82942 :: 80433 :: 123311

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 80533 :: 79706 :: 120317

        Quad Shared Test (1 queue, 2 senders x 2 Receivers, 4 worker threads)

        trunk: patched: patch-2 (locked listeners):

        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 64184 :: 64170 :: 64247 :: 63992

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 63589 :: 63524 :: 63628 :: 63428

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 63718 :: 63685 :: 63772 :: 63555

        Federation Funnel (worker threads = 3 upstream, 2 down)

        trunk: patched: patch-2 (locked listeners):

        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 90953 :: 90269 :: 105194

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 86468 :: 85530 :: 100796

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 89169 :: 88839 :: 92910

        That's very interesting. There are only two changes I can see on the critical path for these tests:

        moving the update of mgmt stats outside the lock (very sensible as we know already that adds some overhead and the locking is not required)

        (ii) the Consumer::filter() and Consumer::accept() are also moved out on the consumer side; as filter always currently returns true I think this boils down to the credit check in accept()

        Would be interesting to know the relative impact of these.

        • Gordon

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Gordon Sim wrote: The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Kenneth Giusti wrote: Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with). Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along? Gordon Sim wrote: My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved. Gordon Sim wrote: re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock). Kenneth Giusti wrote: I've moved the listeners.populate() calls back under the lock. There is a performance degradation, but there's still an overall improvement: Funnel Test: (3 worker threads) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 91857 :: 87240 :: 133878 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 84184 :: 82794 :: 126980 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 81846 :: 79640 :: 124192 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 82942 :: 80433 :: 123311 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 80533 :: 79706 :: 120317 Quad Shared Test (1 queue, 2 senders x 2 Receivers, 4 worker threads) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 64184 :: 64170 :: 64247 :: 63992 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 63589 :: 63524 :: 63628 :: 63428 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 63718 :: 63685 :: 63772 :: 63555 Federation Funnel (worker threads = 3 upstream, 2 down) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 90953 :: 90269 :: 105194 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 86468 :: 85530 :: 100796 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 89169 :: 88839 :: 92910 That's very interesting. There are only two changes I can see on the critical path for these tests: moving the update of mgmt stats outside the lock (very sensible as we know already that adds some overhead and the locking is not required) (ii) the Consumer::filter() and Consumer::accept() are also moved out on the consumer side; as filter always currently returns true I think this boils down to the credit check in accept() Would be interesting to know the relative impact of these. Gordon ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        Gordon Sim wrote:

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        Kenneth Giusti wrote:

        Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with).

        Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along?

        Gordon Sim wrote:

        My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved.

        Gordon Sim wrote:

        re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock).

        I've moved the listeners.populate() calls back under the lock. There is a performance degradation, but there's still an overall improvement:

        Funnel Test: (3 worker threads)

        trunk: patched: patch-2 (locked listeners):
        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1
        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 91857 :: 87240 :: 133878
        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 84184 :: 82794 :: 126980
        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 81846 :: 79640 :: 124192
        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 82942 :: 80433 :: 123311
        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 80533 :: 79706 :: 120317

        Quad Shared Test (1 queue, 2 senders x 2 Receivers, 4 worker threads)

        trunk: patched: patch-2 (locked listeners):
        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2
        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 64184 :: 64170 :: 64247 :: 63992
        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 63589 :: 63524 :: 63628 :: 63428
        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 63718 :: 63685 :: 63772 :: 63555

        Federation Funnel (worker threads = 3 upstream, 2 down)

        trunk: patched: patch-2 (locked listeners):
        tp(m/s) tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1
        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 90953 :: 90269 :: 105194
        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 86468 :: 85530 :: 100796
        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 89169 :: 88839 :: 92910

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Gordon Sim wrote: The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Kenneth Giusti wrote: Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with). Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along? Gordon Sim wrote: My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved. Gordon Sim wrote: re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock). I've moved the listeners.populate() calls back under the lock. There is a performance degradation, but there's still an overall improvement: Funnel Test: (3 worker threads) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 91857 :: 87240 :: 133878 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 84184 :: 82794 :: 126980 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 81846 :: 79640 :: 124192 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 82942 :: 80433 :: 123311 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 80533 :: 79706 :: 120317 Quad Shared Test (1 queue, 2 senders x 2 Receivers, 4 worker threads) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 64184 :: 64170 :: 64247 :: 63992 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 63589 :: 63524 :: 63628 :: 63428 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 63718 :: 63685 :: 63772 :: 63555 Federation Funnel (worker threads = 3 upstream, 2 down) trunk: patched: patch-2 (locked listeners): tp(m/s) tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 90953 :: 90269 :: 105194 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 86468 :: 85530 :: 100796 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 89169 :: 88839 :: 92910 Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        Gordon Sim wrote:

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        Kenneth Giusti wrote:

        Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with).

        Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along?

        Gordon Sim wrote:

        My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved.

        re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock).

        • Gordon

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Gordon Sim wrote: The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Kenneth Giusti wrote: Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with). Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along? Gordon Sim wrote: My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved. re "I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with)." I would have thought that this was one of the more significant change, no? (Along with moving the mgmt stats outside the lock). Gordon ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        Gordon Sim wrote:

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        Kenneth Giusti wrote:

        Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with).

        Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along?

        My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved.

        • Gordon

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Gordon Sim wrote: The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Kenneth Giusti wrote: Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with). Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along? My mistake, I misread the patch, sorry! I think you are right that the ordering is preserved. Gordon ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-08 11:00:25, Gordon Sim wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 957

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line957>

        >

        > Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained.

        Kenneth Giusti wrote:

        This may be overkill on my behalf: The reason for losing the consumerLock and using the messageLock was simply for locking consistency across the observer interface - using the same lock across all observer callouts. Maybe not necessary, but I believe the observers currently assume the queue is providing locking and that implies using the same lock.

        Ah, I get it now. This is part of the addition of two new consumer related observer events? If so, I think the change to lock used is perfectly reasonable, but that may be worth breaking out as a separate patch since it is new functionality.

        • Gordon

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5717
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-08 11:00:25, Gordon Sim wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 957 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line957 > > > Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained. Kenneth Giusti wrote: This may be overkill on my behalf: The reason for losing the consumerLock and using the messageLock was simply for locking consistency across the observer interface - using the same lock across all observer callouts. Maybe not necessary, but I believe the observers currently assume the queue is providing locking and that implies using the same lock. Ah, I get it now. This is part of the addition of two new consumer related observer events? If so, I think the change to lock used is perfectly reasonable, but that may be worth breaking out as a separate patch since it is new functionality. Gordon ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5717 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-08 14:52:43, Alan Conway wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148

        > <https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148>

        >

        > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic.

        What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it.

        On 2012-03-08 14:52:43, Alan Conway wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 537

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line537>

        >

        > why did this change from consumerLock?

        See above re: Gordon's comment.

        On 2012-03-08 14:52:43, Alan Conway wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 1414

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line1414>

        >

        > Does the snapshot have to be atomic with setting deleted = true?

        I'm going to try putting the listener back under the lock - I doubt moving it out gave us much at all, and seems racey.

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5723
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-08 14:52:43, Alan Conway wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.h, line 148 > < https://reviews.apache.org/r/4232/diff/1/?file=88779#file88779line148 > > > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic. What is the "ScopedLock convention"? Is that when we pass the held lock as an argument? I like that approach better, since the compiler can enforce it. On 2012-03-08 14:52:43, Alan Conway wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 537 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line537 > > > why did this change from consumerLock? See above re: Gordon's comment. On 2012-03-08 14:52:43, Alan Conway wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 1414 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line1414 > > > Does the snapshot have to be atomic with setting deleted = true? I'm going to try putting the listener back under the lock - I doubt moving it out gave us much at all, and seems racey. Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5723 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5723
        -----------------------------------------------------------

        Looks good overall, some questions & comments in line.

        /trunk/qpid/cpp/src/qpid/broker/Queue.h
        <https://reviews.apache.org/r/4232/#comment12458>

        Need to make it clear what state each lock covers. My preference would be to actually make 3 structs, each with a lock and the state that lock covers. At least we should indicate by comments.

        /trunk/qpid/cpp/src/qpid/broker/Queue.h
        <https://reviews.apache.org/r/4232/#comment12457>

        I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic.

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp
        <https://reviews.apache.org/r/4232/#comment12464>

        why did this change from consumerLock?

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp
        <https://reviews.apache.org/r/4232/#comment12465>

        Does the snapshot have to be atomic with setting deleted = true?

        • Alan

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5723 ----------------------------------------------------------- Looks good overall, some questions & comments in line. /trunk/qpid/cpp/src/qpid/broker/Queue.h < https://reviews.apache.org/r/4232/#comment12458 > Need to make it clear what state each lock covers. My preference would be to actually make 3 structs, each with a lock and the state that lock covers. At least we should indicate by comments. /trunk/qpid/cpp/src/qpid/broker/Queue.h < https://reviews.apache.org/r/4232/#comment12457 > I prefer the ScopedLock& convention over the LH convention but I guess that's cosmetic. /trunk/qpid/cpp/src/qpid/broker/Queue.cpp < https://reviews.apache.org/r/4232/#comment12464 > why did this change from consumerLock? /trunk/qpid/cpp/src/qpid/broker/Queue.cpp < https://reviews.apache.org/r/4232/#comment12465 > Does the snapshot have to be atomic with setting deleted = true? Alan On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        Gordon Sim wrote:

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with).

        Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along?

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Gordon Sim wrote: The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Re: listeners - makes sense. I thought the consistency of the listener could be decoupled from the queue, but clearly that doesn't make sense (and probably buys us very little to begin with). Re: ordering of events as seen by observers. I want to guarantee that the order of events are preserved WRT the observers - otherwise stuff will break as it stands (like flow control, which relies on correct ordering). Gordon - can you explain the acquire/dequeues reordering case? Does this patch introduce that reordering, or has that been possible all along? Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-08 11:00:25, Gordon Sim wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 957

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line957>

        >

        > Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained.

        This may be overkill on my behalf: The reason for losing the consumerLock and using the messageLock was simply for locking consistency across the observer interface - using the same lock across all observer callouts. Maybe not necessary, but I believe the observers currently assume the queue is providing locking and that implies using the same lock.

        On 2012-03-08 11:00:25, Gordon Sim wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 411

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line411>

        >

        > As it stands, I believe this is unsafe. If you have an empty queue, with a concurrent consumeNextMessage() from a consumer with a push() from a publisher then it is possible the consumer does not get a message and does not get notified that there is one.

        >

        > Moving the listeners.addListener() and the listeners.populate() (from push()) outside the lock scope means that the consuming thread can fail to get a message from the allocator, the publishing thread then adds a message and before the consuming thread adds the consumer to the set of listeners, the publishing thread populates its set (which will be empty) for notification.

        >

        > Thus the message remains on the queue without being delivered to an interested consumer until something else happens to retrigger a notification.

        +1000 thanks - I see the problem now.

        On 2012-03-08 11:00:25, Gordon Sim wrote:

        > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 1071

        > <https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line1071>

        >

        > This looks like a subtle change to the logic that is perhaps unintentional? Previously if ctxt was non null the isEnqueued() test would still be executed and if false, the method would return. That test is now skipped for the transactional case.

        >

        > (If I'm right then a transactional dequeue attempt on a persistent message that has been removed say due to a ring queue policy limit being reached will result in a second dequeue attempt on the store that will most likely fail.)

        What the hell did I do? Thanks for catching that - totally unintended.

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5717
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-08 11:00:25, Gordon Sim wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 957 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line957 > > > Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained. This may be overkill on my behalf: The reason for losing the consumerLock and using the messageLock was simply for locking consistency across the observer interface - using the same lock across all observer callouts. Maybe not necessary, but I believe the observers currently assume the queue is providing locking and that implies using the same lock. On 2012-03-08 11:00:25, Gordon Sim wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 411 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line411 > > > As it stands, I believe this is unsafe. If you have an empty queue, with a concurrent consumeNextMessage() from a consumer with a push() from a publisher then it is possible the consumer does not get a message and does not get notified that there is one. > > Moving the listeners.addListener() and the listeners.populate() (from push()) outside the lock scope means that the consuming thread can fail to get a message from the allocator, the publishing thread then adds a message and before the consuming thread adds the consumer to the set of listeners, the publishing thread populates its set (which will be empty) for notification. > > Thus the message remains on the queue without being delivered to an interested consumer until something else happens to retrigger a notification. +1000 thanks - I see the problem now. On 2012-03-08 11:00:25, Gordon Sim wrote: > /trunk/qpid/cpp/src/qpid/broker/Queue.cpp, line 1071 > < https://reviews.apache.org/r/4232/diff/1/?file=88780#file88780line1071 > > > This looks like a subtle change to the logic that is perhaps unintentional? Previously if ctxt was non null the isEnqueued() test would still be executed and if false, the method would return. That test is now skipped for the transactional case. > > (If I'm right then a transactional dequeue attempt on a persistent message that has been removed say due to a ring queue policy limit being reached will result in a second dequeue attempt on the store that will most likely fail.) What the hell did I do? Thanks for catching that - totally unintended. Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5717 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Kenneth Giusti wrote:

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other).

        Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition.

        • Gordon

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth Giusti wrote: Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. The order of events as seen by observers can be re-arranged with the patch as it stands (at least those for acquires and dequeues, enqueues would still be ordered correctly wrt each other). Regarding the listeners access, I think you could fix it by e.g. having an atomic counter that was incremented after pushing the message but before populating notification set, checking that count before trying to acquire the message on the consume side and then rechecking it before adding the consumer to listeners if there was no message. If the number changes between the last two checks you would retry the acquisition. Gordon ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Kenneth Giusti wrote:

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more.

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Kenneth Giusti wrote: Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Doh - sorry, replied with way too low Caffiene/Blood ratio. You're referring to the lock in the listener - yes. Gordon points out that this may be unsafe, so I need to think about it a bit more. Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-03-07 21:30:28, Andrew Stitcher wrote:

        > This is great work.

        >

        > The most important thing on my mind though is how do ensure that work like this maintains correctness?

        >

        > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        >

        > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock.

        The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged.

        • Kenneth

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-03-07 21:30:28, Andrew Stitcher wrote: > This is great work. > > The most important thing on my mind though is how do ensure that work like this maintains correctness? > > Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). > > On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Re: locking rules - good point Andrew. Most (all?) of these changes I made simply by manually reviewing the existing code's locking behavior, and trying to identify what can be safely moved outside the locks. I'll update this patch with comments that explicitly call out those data members & actions that must hold the lock. The RWLock - hmm.. that may be an unintentional change: at first I had a separate lock for the queue observer container which was rarely written too. I abandoned that change and put the queue observers under the queue lock since otherwise the order of events (as seen by the observers) could be re-arranged. Kenneth ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5717
        -----------------------------------------------------------

        The numbers are encouraging! As mentioned below I think this still needs some work to be safe however.

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp
        <https://reviews.apache.org/r/4232/#comment12449>

        As it stands, I believe this is unsafe. If you have an empty queue, with a concurrent consumeNextMessage() from a consumer with a push() from a publisher then it is possible the consumer does not get a message and does not get notified that there is one.

        Moving the listeners.addListener() and the listeners.populate() (from push()) outside the lock scope means that the consuming thread can fail to get a message from the allocator, the publishing thread then adds a message and before the consuming thread adds the consumer to the set of listeners, the publishing thread populates its set (which will be empty) for notification.

        Thus the message remains on the queue without being delivered to an interested consumer until something else happens to retrigger a notification.

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp
        <https://reviews.apache.org/r/4232/#comment12451>

        Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained.

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp
        <https://reviews.apache.org/r/4232/#comment12450>

        This looks like a subtle change to the logic that is perhaps unintentional? Previously if ctxt was non null the isEnqueued() test would still be executed and if false, the method would return. That test is now skipped for the transactional case.

        (If I'm right then a transactional dequeue attempt on a persistent message that has been removed say due to a ring queue policy limit being reached will result in a second dequeue attempt on the store that will most likely fail.)

        • Gordon

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5717 ----------------------------------------------------------- The numbers are encouraging! As mentioned below I think this still needs some work to be safe however. /trunk/qpid/cpp/src/qpid/broker/Queue.cpp < https://reviews.apache.org/r/4232/#comment12449 > As it stands, I believe this is unsafe. If you have an empty queue, with a concurrent consumeNextMessage() from a consumer with a push() from a publisher then it is possible the consumer does not get a message and does not get notified that there is one. Moving the listeners.addListener() and the listeners.populate() (from push()) outside the lock scope means that the consuming thread can fail to get a message from the allocator, the publishing thread then adds a message and before the consuming thread adds the consumer to the set of listeners, the publishing thread populates its set (which will be empty) for notification. Thus the message remains on the queue without being delivered to an interested consumer until something else happens to retrigger a notification. /trunk/qpid/cpp/src/qpid/broker/Queue.cpp < https://reviews.apache.org/r/4232/#comment12451 > Switching from the consumerLock to messageLock may simplify the code but obviously won't improve the concurrency. The lock previously used here was dedicated to protected an entirely distinct set of data. However I would agree that there is in reality likely to be little gained. /trunk/qpid/cpp/src/qpid/broker/Queue.cpp < https://reviews.apache.org/r/4232/#comment12450 > This looks like a subtle change to the logic that is perhaps unintentional? Previously if ctxt was non null the isEnqueued() test would still be executed and if false, the method would return. That test is now skipped for the transactional case. (If I'm right then a transactional dequeue attempt on a persistent message that has been removed say due to a ring queue policy limit being reached will result in a second dequeue attempt on the store that will most likely fail.) Gordon On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/#review5691
        -----------------------------------------------------------

        This is great work.

        The most important thing on my mind though is how do ensure that work like this maintains correctness?

        Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any).

        On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win.

        • Andrew

        On 2012-03-07 19:23:36, Kenneth Giusti wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4232/

        -----------------------------------------------------------

        (Updated 2012-03-07 19:23:36)

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary

        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.

        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730

        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649

        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617

        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513

        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2

        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638

        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157

        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:

        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d

        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d

        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0

        qpid-config -b $DST_HOST add exchange fanout dstX1

        qpid-config -b $DST_HOST bind dstX1 dstQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1

        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2

        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &

        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:

        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1

        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418

        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386

        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        - fix message group errors

        - verify/fix cluster

        This addresses bug qpid-3890.

        https://issues.apache.org/jira/browse/qpid-3890

        Diffs

        -----

        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185

        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing

        -------

        unit test (non-cluster)

        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/#review5691 ----------------------------------------------------------- This is great work. The most important thing on my mind though is how do ensure that work like this maintains correctness? Do we have some rules/guidelines for the use of the Queue structs that we can use to inspect the code to satisfy its correctness? A simple rule like this would be "take the blah lock before capturing or changing the foo state". It looks like we might have had rules like that, but these changes are so broad it's difficult for me to inspect the code and reconstruct what rules now apply (if any). On another point - I see you've added the use of a RWLock - it's always good to benchmark with a plain mutex before using one of these as they are generally higher overhead and unless the use is "read lock mostly write lock seldom" plain mutexes can win. Andrew On 2012-03-07 19:23:36, Kenneth Giusti wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- (Updated 2012-03-07 19:23:36) Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: - fix message group errors - verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs ----- /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4232/
        -----------------------------------------------------------

        Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross.

        Summary
        -------

        In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held.

        I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting.

        As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel.
        All numbers are in messages/second - sorted by best overall throughput.

        Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1
        73144 :: 72524 :: 112914 91120 :: 83278 :: 133730
        70299 :: 68878 :: 110905 91079 :: 87418 :: 133649
        70002 :: 69882 :: 110241 83957 :: 83600 :: 131617
        70523 :: 68681 :: 108372 83911 :: 82845 :: 128513
        68135 :: 68022 :: 107949 85999 :: 81863 :: 125575

        Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=4
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2
        52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638
        51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157
        50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748

        **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300%

        Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream)

        test setup and execute info:
        qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d
        qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d
        qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
        qpid-config -b $DST_HOST add exchange fanout dstX1
        qpid-config -b $DST_HOST bind dstX1 dstQ1
        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1
        qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2
        qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total &
        qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &
        qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no &

        trunk: patched:
        tp(m/s) tp(m/s)

        S1 :: S2 :: R1 S1 :: S2 :: R1
        86150 :: 84146 :: 87665 90877 :: 88534 :: 108418
        89188 :: 88319 :: 82469 87014 :: 86753 :: 107386
        88221 :: 85298 :: 82455 89790 :: 88573 :: 104119

        Still TBD:

        • fix message group errors
        • verify/fix cluster

        This addresses bug qpid-3890.
        https://issues.apache.org/jira/browse/qpid-3890

        Diffs


        /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185
        /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185
        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185
        /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185

        Diff: https://reviews.apache.org/r/4232/diff

        Testing
        -------

        unit test (non-cluster)
        performance tests as described above.

        Thanks,

        Kenneth

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4232/ ----------------------------------------------------------- Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted Ross. Summary ------- In order to improve our parallelism across threads, this patch reduces the scope over which the Queue's message lock is held. I've still got more testing to do, but the simple test cases below show pretty good performance improvements on my multi-core box. I'd like to get some early feedback, as there are a lot of little changes to the queue locking that will need vetting. As far as performance is concerned - I've run the following tests on an 8 core Xeon X5550 2.67Ghz box. I used qpid-send and qpid-receive to generate traffic. All traffic generated in parallel. All numbers are in messages/second - sorted by best overall throughput. Test 1: Funnel Test (shared queue, 2 producers, 1 consumer) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 73144 :: 72524 :: 112914 91120 :: 83278 :: 133730 70299 :: 68878 :: 110905 91079 :: 87418 :: 133649 70002 :: 69882 :: 110241 83957 :: 83600 :: 131617 70523 :: 68681 :: 108372 83911 :: 82845 :: 128513 68135 :: 68022 :: 107949 85999 :: 81863 :: 125575 Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=4 qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 :: R2 S1 :: S2 :: R1 :: R2 52826 :: 52364 :: 52386 :: 52275 64826 :: 64780 :: 64811 :: 64638 51457 :: 51269 :: 51399 :: 51213 64630 :: 64157 :: 64562 :: 64157 50557 :: 50432 :: 50468 :: 50366 64023 :: 63832 :: 64092 :: 63748 **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test. With the patch, qpidd's cpu utilization was steady at 300% Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 consumer downstream) test setup and execute info: qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000 --max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 --max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0 qpid-config -b $DST_HOST add exchange fanout dstX1 qpid-config -b $DST_HOST bind dstX1 dstQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1 qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2 qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 --ack-frequency 1000 --print-content no --report-total & qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 2000 --report-total --sequence no --timestamp no & trunk: patched: tp(m/s) tp(m/s) S1 :: S2 :: R1 S1 :: S2 :: R1 86150 :: 84146 :: 87665 90877 :: 88534 :: 108418 89188 :: 88319 :: 82469 87014 :: 86753 :: 107386 88221 :: 85298 :: 82455 89790 :: 88573 :: 104119 Still TBD: fix message group errors verify/fix cluster This addresses bug qpid-3890. https://issues.apache.org/jira/browse/qpid-3890 Diffs /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 Diff: https://reviews.apache.org/r/4232/diff Testing ------- unit test (non-cluster) performance tests as described above. Thanks, Kenneth
        Ken Giusti created issue -

          People

          • Assignee:
            Ken Giusti
            Reporter:
            Ken Giusti
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development