Description
To my surprise, IMAP performance tests were highly limited by RabbitMQ.
We lacked decent metrics on RabbitMQ / the event bus to clearly audit this.
I added a few additional metrics, here are the results:
name=rabbit-acquire, count=52615, min=0.010816, max=2197.815295, mean=14.692384926275777, stddev=84.45677147245601, p50=0.075775, p75=0.203775, p95=63.176703, p98=199.229439, p99=375.390207, p999=1216.348159, m1_rate=203.7148778352276, m5_rate=119.5444112071225, m15_rate=51.27213833578196, mean_rate=83.30809197633765, rate_unit=events/second, duration_unit=milliseconds name=rabbit-dispatch, count=27858, min=0.333824, max=2365.587455, mean=54.42362489080336, stddev=132.51032578954067, p50=15.466495, p75=43.253759, p95=229.638143, p98=432.013311, p99=633.339903, p999=1753.219071, m1_rate=109.2818061840995, m5_rate=70.86014994542329, m15_rate=40.57835311284083, mean_rate=104.18175115750351, rate_unit=events/second, duration_unit=milliseconds name=rabbit-register, count=2976, min=9.633792, max=5603.590143, mean=179.2821071827957, stddev=508.63321381095363, p50=50.331647, p75=103.809023, p95=687.865855, p98=2013.265919, p99=3003.121663, p999=5100.273663, m1_rate=3.7740876538564017, m5_rate=9.38568671432365, m15_rate=9.574694038543646, mean_rate=11.166515283444058, rate_unit=events/second, duration_unit=milliseconds name=rabbit-release, count=52600, min=6.64E-4, max=131.596287, mean=0.12847917764258554, stddev=1.7017175408795067, p50=0.006111, p75=0.010303, p95=0.035583, p98=1.269759, p99=2.719743, p999=17.432575, m1_rate=204.4922821701763, m5_rate=136.5136219818052, m15_rate=81.73827938617151, mean_rate=197.1284478914709, rate_unit=events/second, duration_unit=milliseconds name=rabbit-unregister, count=449, min=10.878976, max=2466.250751, mean=190.00389787082403, stddev=380.5671338872364, p50=51.118079, p75=135.266303, p95=1010.827263, p98=1702.887423, p99=1912.602623, p999=2466.250751, m1_rate=9.012783767745082, m5_rate=5.543710748795059, m15_rate=4.918174526687269, mean_rate=19.889486577715797, rate_unit=events/second, duration_unit=milliseconds
Analysis:
- dispatch takes a really long time and impacts negatively all other operations
- the channel pool was undersized (contention to get a channel)
I did try out the followings:
- https://issues.apache.org/jira/browse/JAMES-3747 reactive implementation for the RabbitMQ channel pool.
Better reactive code but not a game changer to be honnest.
- Shorter routing key (don't include the full FQDN) -> small performance gains...
- Disable publish confirms: Game changer! Dispatch mean went from 50ms+ p99 to 500ms+ to mean 1ms, p99 8ms... All other metrics (bind / unbind) are impacted as well. Contention to acquire a channel is effectively gone...
- Turning off durability on notifiation channels unlocked further gains.
name=rabbit-acquire, count=1380387, min=0.005824, max=132.120575, mean=0.120752084973272, stddev=0.47552513438388405, p50=0.056831, p75=0.096767, p95=0.354303, p98=0.692223, p99=1.122303, p999=4.915199, m1_rate=1.6804637345701686E-238, m5_rate=9.96453889901133E-47, m15_rate=9.66533044449863E-15, mean_rate=32.71739356870932, rate_unit=events/second, duration_unit=milliseconds name=rabbit-dispatch, count=757489, min=0.063232, max=245.366783, mean=0.6006688857527964, stddev=1.0950763726058712, p50=0.456703, p75=0.610303, p95=1.310719, p98=2.064383, p99=2.949119, p999=9.764863, m1_rate=9.8844762264434E-239, m5_rate=5.952316454575153E-47, m15_rate=5.761608528831366E-15, mean_rate=17.954172496560656, rate_unit=events/second, duration_unit=milliseconds name=rabbit-register, count=18810, min=3.6864, max=1317.011455, mean=21.051024209250397, stddev=41.433421890807836, p50=12.058623, p75=19.529727, p95=66.322431, p98=106.954751, p99=155.189247, p999=507.510783, m1_rate=1.5731326523774949E-260, m5_rate=2.4205436310646514E-54, m15_rate=3.643125827717571E-18, mean_rate=0.4463666568673792, rate_unit=events/second, duration_unit=milliseconds name=rabbit-release, count=1380385, min=6.6E-4, max=131.596287, mean=0.00816027294848901, stddev=0.12039944630360619, p50=0.006143, p75=0.009087, p95=0.013183, p98=0.021759, p99=0.034047, p999=0.230399, m1_rate=1.6486782760303405E-238, m5_rate=9.925571881590765E-47, m15_rate=9.652806233445519E-15, mean_rate=32.71825157947973, rate_unit=events/second, duration_unit=milliseconds name=rabbit-unregister, count=18810, min=3.11296, max=1761.607679, mean=28.082487391812865, stddev=64.54031379534226, p50=12.582911, p75=20.971519, p95=100.139007, p98=192.937983, p99=287.309823, p999=805.306367, m1_rate=7.66299030904417E-260, m5_rate=1.8971818595887257E-53, m15_rate=8.637433599041584E-18, mean_rate=0.44693388443373244, rate_unit=events/second, duration_unit=milliseconds
I was able to double the request count in my IMAP benches and still get a 3 fold latency reduction.
Proposals
- Offer an option to disable publish confirms. This new James 3.7.0 behaviour brings cool resiliency semantic but is definitly harmful for scalability. We can imagine some users wanting to turn that off.
- Offer a way to turn off durability on notifications. Notifications is likely not critical, and loss acceptable.
- Add those cool rabbitMQ metrics.
And of course, invest in an alternative to RabbitMQ that do not force us to choose between throughtput and safety. Thoughts: Pulsar.