Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-3955

James stops consuming sometimes RabbitMQ queue

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • None
    • None
    • rabbitmq
    • None

    Description

      We sometimes had troubles with RabbitMQ in some production environments where james would stop consuming some queues (like the mail queue) and we never would understand really why, and we would just restart James in this case.

      Well recently I had similar issues but with TaskManagerWorkQueue. Except that we managed to reproduce the problem manually. We have a task we play at night that can take a long time to play. After had some other planned tasks as well, we could observe the following pattern:

      While the heavy task is being executed by James, others are pilling up in the TaskManagerWorkQueue. They getting unacked by James, meaning it's telling RabbitMQ that it will consume them later (as James executes one task at a time). Except that after 30 minutes after the first unacked item in the queue, could see James stopping consuming the queue, and all items coming back to the ready state.

      After looking around RabbitMQ configuration: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

      RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel exception when detecting that an item here the first one being unacked) has not been consumed within 30 minutes. Matching with what we observed actually.

      From this I guess we could deduce that when we had a similar issue with the mail queue, maybe James failed to consume properly a message or failed at acknowledging it for some reason and got the channel closed by RabbitMQ. Which I guess is there to prevent having messages being stuck if the consumer has issue to ack it correctly. 

      From there, there is some actions we can take to prevent this:

      • adding error logs when we get the channel closed on such an exception
      • trying to reconnect to the channel when such an exception occurs
      • on at least important queues like task manager queue, mail queue, event bus
      • potentially try to audit as well if in some cases we do not ack/nack the message back
      •  giving the possibility to increase the consumer timeout of the above queue with the `x-consumer-timeout` queue argument (would require to run rabbitmq 3.12 at least)

      For now we can as well increase that timeout in rabbitmq.conf to minimize the problems.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcordier René Cordier
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 10m
                5h 10m