Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-3749

Better metrics for RabbitMQ

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.7.0
    • 3.8.0
    • Metrics, rabbitmq
    • None

    Description

      To my surprise, IMAP performance tests were highly limited by RabbitMQ.

      We lacked decent metrics on RabbitMQ / the event bus to clearly audit this.

      I added a few additional metrics, here are the results:

      name=rabbit-acquire, count=52615, min=0.010816, max=2197.815295, mean=14.692384926275777, stddev=84.45677147245601, p50=0.075775, p75=0.203775, p95=63.176703, p98=199.229439, p99=375.390207, p999=1216.348159, m1_rate=203.7148778352276, m5_rate=119.5444112071225, m15_rate=51.27213833578196, mean_rate=83.30809197633765, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-dispatch, count=27858, min=0.333824, max=2365.587455, mean=54.42362489080336, stddev=132.51032578954067, p50=15.466495, p75=43.253759, p95=229.638143, p98=432.013311, p99=633.339903, p999=1753.219071, m1_rate=109.2818061840995, m5_rate=70.86014994542329, m15_rate=40.57835311284083, mean_rate=104.18175115750351, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-register, count=2976, min=9.633792, max=5603.590143, mean=179.2821071827957, stddev=508.63321381095363, p50=50.331647, p75=103.809023, p95=687.865855, p98=2013.265919, p99=3003.121663, p999=5100.273663, m1_rate=3.7740876538564017, m5_rate=9.38568671432365, m15_rate=9.574694038543646, mean_rate=11.166515283444058, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-release, count=52600, min=6.64E-4, max=131.596287, mean=0.12847917764258554, stddev=1.7017175408795067, p50=0.006111, p75=0.010303, p95=0.035583, p98=1.269759, p99=2.719743, p999=17.432575, m1_rate=204.4922821701763, m5_rate=136.5136219818052, m15_rate=81.73827938617151, mean_rate=197.1284478914709, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-unregister, count=449, min=10.878976, max=2466.250751, mean=190.00389787082403, stddev=380.5671338872364, p50=51.118079, p75=135.266303, p95=1010.827263, p98=1702.887423, p99=1912.602623, p999=2466.250751, m1_rate=9.012783767745082, m5_rate=5.543710748795059, m15_rate=4.918174526687269, mean_rate=19.889486577715797, rate_unit=events/second, duration_unit=milliseconds
      

      Analysis:

      • dispatch takes a really long time and impacts negatively all other operations
      • the channel pool was undersized (contention to get a channel)

      I did try out the followings:

      Better reactive code but not a game changer to be honnest.

      • Shorter routing key (don't include the full FQDN) -> small performance gains...
      • Disable publish confirms: Game changer! Dispatch mean went from 50ms+ p99 to 500ms+ to mean 1ms, p99 8ms... All other metrics (bind / unbind) are impacted as well. Contention to acquire a channel is effectively gone...
      • Turning off durability on notifiation channels unlocked further gains.
      name=rabbit-acquire, count=1380387, min=0.005824, max=132.120575, mean=0.120752084973272, stddev=0.47552513438388405, p50=0.056831, p75=0.096767, p95=0.354303, p98=0.692223, p99=1.122303, p999=4.915199, m1_rate=1.6804637345701686E-238, m5_rate=9.96453889901133E-47, m15_rate=9.66533044449863E-15, mean_rate=32.71739356870932, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-dispatch, count=757489, min=0.063232, max=245.366783, mean=0.6006688857527964, stddev=1.0950763726058712, p50=0.456703, p75=0.610303, p95=1.310719, p98=2.064383, p99=2.949119, p999=9.764863, m1_rate=9.8844762264434E-239, m5_rate=5.952316454575153E-47, m15_rate=5.761608528831366E-15, mean_rate=17.954172496560656, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-register, count=18810, min=3.6864, max=1317.011455, mean=21.051024209250397, stddev=41.433421890807836, p50=12.058623, p75=19.529727, p95=66.322431, p98=106.954751, p99=155.189247, p999=507.510783, m1_rate=1.5731326523774949E-260, m5_rate=2.4205436310646514E-54, m15_rate=3.643125827717571E-18, mean_rate=0.4463666568673792, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-release, count=1380385, min=6.6E-4, max=131.596287, mean=0.00816027294848901, stddev=0.12039944630360619, p50=0.006143, p75=0.009087, p95=0.013183, p98=0.021759, p99=0.034047, p999=0.230399, m1_rate=1.6486782760303405E-238, m5_rate=9.925571881590765E-47, m15_rate=9.652806233445519E-15, mean_rate=32.71825157947973, rate_unit=events/second, duration_unit=milliseconds
      
      name=rabbit-unregister, count=18810, min=3.11296, max=1761.607679, mean=28.082487391812865, stddev=64.54031379534226, p50=12.582911, p75=20.971519, p95=100.139007, p98=192.937983, p99=287.309823, p999=805.306367, m1_rate=7.66299030904417E-260, m5_rate=1.8971818595887257E-53, m15_rate=8.637433599041584E-18, mean_rate=0.44693388443373244, rate_unit=events/second, duration_unit=milliseconds
      

      I was able to double the request count in my IMAP benches and still get a 3 fold latency reduction.

      Proposals

      • Offer an option to disable publish confirms. This new James 3.7.0 behaviour brings cool resiliency semantic but is definitly harmful for scalability. We can imagine some users wanting to turn that off.
      • Offer a way to turn off durability on notifications. Notifications is likely not critical, and loss acceptable.
      • Add those cool rabbitMQ metrics.

      And of course, invest in an alternative to RabbitMQ that do not force us to choose between throughtput and safety. Thoughts: Pulsar.

      Attachments

        Activity

          People

            Unassigned Unassigned
            btellier Benoit Tellier
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: