Description
Today, when the quorum option is enabled, only some queues are quorum queues, not all (e.g. event bus notification queues and Task Manager's termination queues).
On a James deployment where we use quorum queues and RabbitMQ cluster 3 nodes, when a RabbitMQ node outages, James can not be fault tolerant against it.
I tried to reproduce what happens and here is my theory:
The RabbitMQ node that stores the notification queues is down
-> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, STORE, APPEND, UNSELECT ... commands to fail
-> James keeps retrying the publish failures (retry for Group registration which seems to rely on the classic queue too) and queues other IMAP requests.
-> The IMAP server queue is full and the exception `The IMAP server has reached its maximum capacity` is thrown.
-> James IMAP becomes a zombie and cascading failures.
James needs to be more fault-tolerant in this case.
I propose we apply quorum queues for all the queues when `
quorum.queues.enable=true` so the queues are still available even when a RabbitMQ node is down, and help James keep functions well.
We did a POC here and the full quorum queues helped James be more fault tolerant as expected.
Attachments
Issue Links
- links to