Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-4027

Make all queues on Rabbitmq quorum queue when option enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • eventbus, Queue, rabbitmq
    • None

    Description

      Today, when the quorum option is enabled, only some queues are quorum queues, not all (e.g. event bus notification queues and Task Manager's termination queues).

      On a James deployment where we use quorum queues and RabbitMQ cluster 3 nodes, when a RabbitMQ node outages, James can not be fault tolerant against it.

      I tried to reproduce what happens and here is my theory: 

      The RabbitMQ node that stores the notification queues is down
      -> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, STORE, APPEND, UNSELECT ... commands to fail
      -> James keeps retrying the publish failures (retry for Group registration which seems to rely on the classic queue too) and queues other IMAP requests.

      -> The IMAP server queue is full and the exception `The IMAP server has reached its maximum capacity` is thrown.

      -> James IMAP becomes a zombie and cascading failures.

      James needs to be more fault-tolerant in this case.

      I propose we apply quorum queues for all the queues when `
      quorum.queues.enable=true` so the queues are still available even when a RabbitMQ node is down, and help James keep functions well.

      We did a POC here  and the full quorum queues helped James be more fault tolerant as expected.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              QuanTH Tran Hong Quan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h