ActiveMQ Apollo
  1. ActiveMQ Apollo
  2. APLO-244

Apollo does not give priority to outgoing messages under stress

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Using the stomp-benchmark scenario file attached to APLO-241, one can play with the number of producers and the overall message rate via the producer_sleep parameter.

      On good hardware, Apollo could handle 60k msg/sec coming from 30k concurrent clients. This is good!

      However, when the message rate is further increased, Apollo spends most of its time queuing the messages it cannot deliver (since I have slow_consumer_policy=queue) instead of delivering them. The end result being that the topic consumer gets no messages at all, making the situation even worse (bigger message store).

      For instance:

      c_c1 samples: [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
      p_p1 samples: [ 275403,208512,241756,238954,241702,269398,362274,390104,437176,408510,459625,454061,447085,454244,419557,413095,235546,76322,40296,12285,2979
      ,1891,1408,1110,866,824,921,909,909,984,988,861,601,332,251,164,165,256,425,495,495,583,660,656,551,783,825,668,680,447,164,165,165,165,165,178,402,329,330,209 ]
      e_p1 samples: [ 0,0,0,4,30,0,0,0,0,0,30,1,0,0,3,1,10,15,0,0,3,1,10,12,3,3,2,2,1,21,2,2,1,0,2,10,11,3,2,3,3,10,15,0,2,2,4,2,20,1,2,6,1,2,18,1,2,3,2,3 ]
      p_p2 samples: [ 204727,164350,193413,196598,193711,217585,269030,278368,319091,296477,325126,323323,267829,118000,35694,24693,15922,7836,4554,3212,2669,1595,1150,1116,994,538,984,1364,1276,1209,1322,1066,992,771,714,496,495,469,535,620,514,581,568,496,273,359,370,281,186,124,4,0,0,0,0,21,124,123,124,124 ]
      e_p2 samples: [ 0,0,0,8,29,1,0,0,0,0,33,2,1,0,1,2,20,11,0,0,0,5,17,9,1,2,3,5,1,21,0,4,5,3,3,10,10,5,3,3,2,14,9,6,6,2,3,3,17,4,4,1,5,1,22,2,4,3,3,9 ]
      p_p3 samples: [ 182945,158541,179534,175291,179664,198728,245723,259168,274541,190171,63161,24280,11886,6765,5802,5324,4299,2962,2183,1743,1273,779,424,262,83,0,363,596,596,429,380,397,368,307,1069,992,837,661,503,333,239,174,102,99,14,174,198,123,124,199,104,99,141,198,198,184,79,0,0,0 ]
      e_p3 samples: [ 0,0,0,10,32,1,0,0,0,1,39,2,1,0,2,1,21,12,0,1,3,2,21,10,1,2,2,3,3,23,4,2,4,3,1,16,10,2,3,4,2,16,12,4,4,4,4,5,18,3,2,3,10,3,21,1,5,5,5,2 ]
      p_p4 samples: [ 195131,164391,192826,184336,187915,206291,212340,103434,19850,7087,4013,3124,2173,2444,2389,1522,1074,1237,1149,820,517,316,197,249,289,176,102,166,84,0,833,892,353,36,332,311,59,0,0,0,0,0,0,0,0,0,0,0,87,332,321,154,0,0,0,12,83,37,0,0 ]
      e_p4 samples: [ 0,0,0,3,38,2,0,0,0,0,34,1,0,0,4,2,15,20,0,0,0,3,11,19,3,3,5,4,3,23,0,2,2,3,4,12,14,5,5,3,4,9,13,4,2,4,3,6,24,5,5,2,2,2,20,4,5,3,5,4 ]

      Would it be possible to give more priority to the outgoing messages?

      1. sbs-20120904-092317.pdf
        244 kB
        Lionel Cons
      2. sbs-20120904-085840.pdf
        253 kB
        Lionel Cons
      3. sbs-20120903-151840.pdf
        311 kB
        Lionel Cons
      4. sbs-20120903-145705.pdf
        302 kB
        Lionel Cons
      5. sbs-20120903-143634.pdf
        287 kB
        Lionel Cons
      6. activemq.xml
        5 kB
        Lionel Cons
      7. sbs-20120822-150021.pdf
        266 kB
        Lionel Cons
      8. sbs-20120822-143646.pdf
        306 kB
        Lionel Cons
      9. APLO-244-89.png
        149 kB
        Lionel Cons
      10. APLO-244-87.png
        101 kB
        Lionel Cons
      11. APLO-244.png
        33 kB
        Lionel Cons
      12. APLO-244.xml
        1 kB
        Lionel Cons

        Activity

        Hide
        Lionel Cons added a comment -

        Here is the scenario file I've used to reproduce the problem. This probably needs to be adjusted on different hardware.

        Show
        Lionel Cons added a comment - Here is the scenario file I've used to reproduce the problem. This probably needs to be adjusted on different hardware.
        Hide
        Hiram Chirino added a comment -

        BTW it would be better to run the consuming client on another machine so that the consuming clients is not contending for resources. By doing that I got better dequeue performance, but the enqueue performance is still dominating.

        Looking to see what can be done.

        Show
        Hiram Chirino added a comment - BTW it would be better to run the consuming client on another machine so that the consuming clients is not contending for resources. By doing that I got better dequeue performance, but the enqueue performance is still dominating. Looking to see what can be done.
        Hide
        Hiram Chirino added a comment -

        I'm also able to keep the consumers up with the producers if you lower the default 'fast_delivery_rate" settings of the queue. Example:

        <topic slow_consumer_policy="queue">
        <subscription fast_delivery_rate="100k"/>
        </topic>

        Show
        Hiram Chirino added a comment - I'm also able to keep the consumers up with the producers if you lower the default 'fast_delivery_rate" settings of the queue. Example: <topic slow_consumer_policy="queue"> <subscription fast_delivery_rate="100k"/> </topic>
        Hide
        Lionel Cons added a comment -

        I've tried fast_delivery_rate="100k" and the situation slightly improves. Instead of no messages delivered to the consumer, some go through, at the beginning. Then the consumed rate stays at zero and the produced rate also goes to zero, see the attached graph showing what stomp-benchmark reported.

        Show
        Lionel Cons added a comment - I've tried fast_delivery_rate="100k" and the situation slightly improves. Instead of no messages delivered to the consumer, some go through, at the beginning. Then the consumed rate stays at zero and the produced rate also goes to zero, see the attached graph showing what stomp-benchmark reported.
        Hide
        Hiram Chirino added a comment -

        I need to get better hardware. What kind are you running on?

        Show
        Hiram Chirino added a comment - I need to get better hardware. What kind are you running on?
        Hide
        Lionel Cons added a comment -

        Each box (the one running stomp-benchmark and the one running Apollo) has 2 x quad-core CPUs (Xeon @ 2.27GHz), with hyperthreading enabled. So 16 cores seen by the OS, along with 24GB of RAM.

        Show
        Lionel Cons added a comment - Each box (the one running stomp-benchmark and the one running Apollo) has 2 x quad-core CPUs (Xeon @ 2.27GHz), with hyperthreading enabled. So 16 cores seen by the OS, along with 24GB of RAM.
        Hide
        Hiram Chirino added a comment -

        Ok. I've been testing on pair of EC2 cc2.8xlarge instances which seem similar sized but with more memory. With the broker configured with:

        <topic slow_consumer_policy="queue">
        <subscription fast_delivery_rate="100k"/>
        </topic>

        Using the changes in build: https://repository.apache.org/content/repositories/snapshots/org/apache/activemq/apache-apollo/99-trunk-SNAPSHOT/apache-apollo-99-trunk-20120820.172920-89-unix-distro.tar.gz

        and running the consumer on the same box as the broker but the producers on the other box I've been seeing the dequeue rates keeping up with the producer rates and also 0 or little message swapping occurring.

        Show
        Hiram Chirino added a comment - Ok. I've been testing on pair of EC2 cc2.8xlarge instances which seem similar sized but with more memory. With the broker configured with: <topic slow_consumer_policy="queue"> <subscription fast_delivery_rate="100k"/> </topic> Using the changes in build: https://repository.apache.org/content/repositories/snapshots/org/apache/activemq/apache-apollo/99-trunk-SNAPSHOT/apache-apollo-99-trunk-20120820.172920-89-unix-distro.tar.gz and running the consumer on the same box as the broker but the producers on the other box I've been seeing the dequeue rates keeping up with the producer rates and also 0 or little message swapping occurring.
        Hide
        Lionel Cons added a comment -

        Unfortunately, this latest snapshot (89) does not produce better results in my environment. It even makes the results of another test (same scenario, different values) much worse, see the attached files:

        • APLO-244-87.png: results with the 87 snapshot
        • APLO-244-89.png: results with the 89 snapshot

        In both cases, this is the exact same hardware and config on both ends (stomp-benchmark and Apollo) and I do use fast_delivery_rate=100k.

        Apollo tuning is tricky, time to address APLO-164?

        Show
        Lionel Cons added a comment - Unfortunately, this latest snapshot (89) does not produce better results in my environment. It even makes the results of another test (same scenario, different values) much worse, see the attached files: APLO-244 -87.png: results with the 87 snapshot APLO-244 -89.png: results with the 89 snapshot In both cases, this is the exact same hardware and config on both ends (stomp-benchmark and Apollo) and I do use fast_delivery_rate=100k. Apollo tuning is tricky, time to address APLO-164 ?
        Hide
        Hiram Chirino added a comment -

        Are you running the consumer on a different box or are you still running it on /w the producers?

        Show
        Hiram Chirino added a comment - Are you running the consumer on a different box or are you still running it on /w the producers?
        Hide
        Lionel Cons added a comment -

        I simply use stomp-benchmark with the attached scenario so both producers and consumer are indeed on the same box.

        Show
        Lionel Cons added a comment - I simply use stomp-benchmark with the attached scenario so both producers and consumer are indeed on the same box.
        Hide
        Hiram Chirino added a comment -

        Ok. Yeah I get the same results as you in that case. But that's mostly because the consumer really is not consuming at full speed. This just seems to be a shortcoming of the benchmark test harness.

        You need to change the test harness to give the 1 consumer the same priority as your 10,000 producers in aggregate. So split the benchmark settings up into 2 files. One with the producers and another with the consumer and run them on 2 different machines. You should see different results.

        Show
        Hiram Chirino added a comment - Ok. Yeah I get the same results as you in that case. But that's mostly because the consumer really is not consuming at full speed. This just seems to be a shortcoming of the benchmark test harness. You need to change the test harness to give the 1 consumer the same priority as your 10,000 producers in aggregate. So split the benchmark settings up into 2 files. One with the producers and another with the consumer and run them on 2 different machines. You should see different results.
        Hide
        Lionel Cons added a comment -

        I'm afraid the problem is not on the consuming side.

        Please look at the two attached reports. In sbs-20120822-143646.pdf (ActiveMQ), we see that messages get delivered. In sbs-20120822-150021.pdf (Apollo), the messages get delivered at the beginning, at a rate less than the producing rate. Then this halts, the broker does not send anymore messages at all and get into a weird state, using more threads.

        These tests have been executed on the very same hardware, with the same stomp-benchmark file. The only difference is the broker being benchmarked.

        Show
        Lionel Cons added a comment - I'm afraid the problem is not on the consuming side. Please look at the two attached reports. In sbs-20120822-143646.pdf (ActiveMQ), we see that messages get delivered. In sbs-20120822-150021.pdf (Apollo), the messages get delivered at the beginning, at a rate less than the producing rate. Then this halts, the broker does not send anymore messages at all and get into a weird state, using more threads. These tests have been executed on the very same hardware, with the same stomp-benchmark file. The only difference is the broker being benchmarked.
        Hide
        Hiram Chirino added a comment -

        Hi Lionel,

        Could you attache the ActiveMQ config that your using for the test run? Would like to also test it for comparison as you have done.

        Show
        Hiram Chirino added a comment - Hi Lionel, Could you attache the ActiveMQ config that your using for the test run? Would like to also test it for comparison as you have done.
        Hide
        Lionel Cons added a comment -

        Here it is...

        Show
        Lionel Cons added a comment - Here it is...
        Hide
        Hiram Chirino added a comment -

        Ok. So the configuration of ActiveMQ and Apollo is not really apples to apples.

        For example, if you make the consumer slow (have him sleep 1 second between messages received). Apollo should allow producers to continue to produce and it will swap messages to disk without stopping the producers. The ActiveMQ configuration on the other hand will be buffering all the messages in memory until you run out of memory and at that point the producers rate should drop to zero.

        Show
        Hiram Chirino added a comment - Ok. So the configuration of ActiveMQ and Apollo is not really apples to apples. For example, if you make the consumer slow (have him sleep 1 second between messages received). Apollo should allow producers to continue to produce and it will swap messages to disk without stopping the producers. The ActiveMQ configuration on the other hand will be buffering all the messages in memory until you run out of memory and at that point the producers rate should drop to zero.
        Hide
        Hiram Chirino added a comment -

        I think the closest way to configure Apollo to have similar behavior would be to use something like:

        <topic slow_consumer_policy="queue">
        <subscription persistent=false quota="500M"/>
        </topic>

        This would avoid your the messages from ever taking a persistence hit and allow up to 500M of messages to buffer up in memory before the producers are slowed down.

        If you want to avoid blocking producers even if consumers are being slow and don't mind taking the persistence penalty, you could try something like:

        <topic slow_consumer_policy="queue">
        <subscription fast_delivery_rate="100k" tail_buffer="500M"/>
        </topic>

        This should help keep most messages in memory to avoid the hit of swapping messages out if the consumer falls behind.

        Show
        Hiram Chirino added a comment - I think the closest way to configure Apollo to have similar behavior would be to use something like: <topic slow_consumer_policy="queue"> <subscription persistent=false quota="500M"/> </topic> This would avoid your the messages from ever taking a persistence hit and allow up to 500M of messages to buffer up in memory before the producers are slowed down. If you want to avoid blocking producers even if consumers are being slow and don't mind taking the persistence penalty, you could try something like: <topic slow_consumer_policy="queue"> <subscription fast_delivery_rate="100k" tail_buffer="500M"/> </topic> This should help keep most messages in memory to avoid the hit of swapping messages out if the consumer falls behind.
        Hide
        Lionel Cons added a comment -

        Thanks for these configuration tips. They should probably end up in APLO-164

        Show
        Lionel Cons added a comment - Thanks for these configuration tips. They should probably end up in APLO-164
        Hide
        Lionel Cons added a comment -

        Here are some benchmark results, with the same use case (many slow producers, one fast consumer), with the exact same setup (hardware, network...).

        20120903-145705: ActiveMQ, handles the load nicely, 13670k messages went through.

        20120904-085840: Apollo (default, i.e. slow_consumer_policy="block"), cannot stand the load, deliveries stop half-way, 3530k messages went through.

        20120903-143634: Apollo (slow_consumer_policy="queue"), cannot stand the load, deliveries stop half-way, 2281k messages went through, very similar to 20120904-085840.

        20120904-092317: Apollo (slow_consumer_policy="queue", persistent="false", quota="2000M"), cannot stand the load, deliveries stop 1/3 of the way, only 1063k messages went through.

        20120903-151840: Apollo (slow_consumer_policy="queue", fast_delivery_rate="100k", tail_buffer="2000M"), stands the load but slow delivery and huge backlog in the end, 8143k messages went through (~ 60% of the messages received by the broker).

        See the attached reports for more information.

        Show
        Lionel Cons added a comment - Here are some benchmark results, with the same use case (many slow producers, one fast consumer), with the exact same setup (hardware, network...). 20120903-145705: ActiveMQ, handles the load nicely, 13670k messages went through. 20120904-085840: Apollo (default, i.e. slow_consumer_policy="block"), cannot stand the load, deliveries stop half-way, 3530k messages went through. 20120903-143634: Apollo (slow_consumer_policy="queue"), cannot stand the load, deliveries stop half-way, 2281k messages went through, very similar to 20120904-085840. 20120904-092317: Apollo (slow_consumer_policy="queue", persistent="false", quota="2000M"), cannot stand the load, deliveries stop 1/3 of the way, only 1063k messages went through. 20120903-151840: Apollo (slow_consumer_policy="queue", fast_delivery_rate="100k", tail_buffer="2000M"), stands the load but slow delivery and huge backlog in the end, 8143k messages went through (~ 60% of the messages received by the broker). See the attached reports for more information.
        Hide
        Hiram Chirino added a comment -

        Hi Lionel,

        Don't want you to think I've forgotten about this issue. I'm still working on it. The slow_consumer_policy="block" scenario should be much better as of tonights snapshot. I'm hopefully the other other scenarios improve as well, but I've not had the time to re-test those yet. I'll update the issue once I know more.

        Show
        Hiram Chirino added a comment - Hi Lionel, Don't want you to think I've forgotten about this issue. I'm still working on it. The slow_consumer_policy="block" scenario should be much better as of tonights snapshot. I'm hopefully the other other scenarios improve as well, but I've not had the time to re-test those yet. I'll update the issue once I know more.

          People

          • Assignee:
            Unassigned
            Reporter:
            Lionel Cons
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development