Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-603

Outgoing spooling stuck over old mails when more than 1000 old mails are present in outgoing.

    Details

      Description

      scenario:
      remote delivery has 6 hours for the third delaytime
      insert into the outgoing spool 1000 messages with a last_updated 5 hours ago and error_message 3
      start james
      send a new message

      the first remote delivery thread is stuck in the main accept method because getNextPendingMessage ALWAYS return a new pending message but none of them is ready to be processed. The bad news is that after it finish the 1000 messages from pendingMessages it simply restart the loadPendingMessages and try them again, without waiting.

      So 100% CPU used until we are able to spool the 1000 "old" messages and then our james will return to normality.

      1. spool-fix.patch
        4 kB
        Noel J. Bergman

        Activity

        Hide
        bago Stefano Bagnara added a comment -

        The same bug also create this scenario:

        You have a single temporary underliverable message in outgoing. Let's say that the next attempt is to be made in 5 hours.
        The first remote delivery thread will loop over getNextPendingMessage/loadPendingMessages and will never stop to wait for it to be available (100% CPU).

        Show
        bago Stefano Bagnara added a comment - The same bug also create this scenario: You have a single temporary underliverable message in outgoing. Let's say that the next attempt is to be made in 5 hours. The first remote delivery thread will loop over getNextPendingMessage/loadPendingMessages and will never stop to wait for it to be available (100% CPU).
        Hide
        bago Stefano Bagnara added a comment -

        The worst scenario: everything stuck and not accepting new mail:

        I already described how it happens to have the ougoing spool locked and every outgoing thread waiting to obtain the lock.
        Now I experienced something worse and I think I got why:
        I have 10 spool threads, 10 smtp workers.
        I have 9 email in the spool to be remotly delivered.
        9 of the 10 spool threads lock the 9 emails from the spool and start waiting to lock the outgoing repository.
        The 10th spool threads start an infinite loop over the accept of the main spool because it find 9 mails, but it can't lock them because are being processed by the other threads, so it keeps an infinite lock over the main spool.(this happen because the loadPendingMessages take more than 1 second maybe because the server is already stressing the db with the outgoing thread looping into the accpet)
        The first 10 incoming smtp connections will stuck trying to acquire the lock on the main spool to store the messages and you are under DOS.

        I clearly remember user reports in the mailing list in the past months/years reporting similar scenario and maybe we finally found the problem.

        So this bug also affect the main spool even if it is more rare because mails in the main spool are always acceptable if they are not locked and this happens only when all the available messages are locked and the accept query takes more than 1 second: but it happens because I saw it and I have the thread dump if anyone want to look at it.

        Show
        bago Stefano Bagnara added a comment - The worst scenario: everything stuck and not accepting new mail: I already described how it happens to have the ougoing spool locked and every outgoing thread waiting to obtain the lock. Now I experienced something worse and I think I got why: I have 10 spool threads, 10 smtp workers. I have 9 email in the spool to be remotly delivered. 9 of the 10 spool threads lock the 9 emails from the spool and start waiting to lock the outgoing repository. The 10th spool threads start an infinite loop over the accept of the main spool because it find 9 mails, but it can't lock them because are being processed by the other threads, so it keeps an infinite lock over the main spool.(this happen because the loadPendingMessages take more than 1 second maybe because the server is already stressing the db with the outgoing thread looping into the accpet) The first 10 incoming smtp connections will stuck trying to acquire the lock on the main spool to store the messages and you are under DOS. I clearly remember user reports in the mailing list in the past months/years reporting similar scenario and maybe we finally found the problem. So this bug also affect the main spool even if it is more rare because mails in the main spool are always acceptable if they are not locked and this happens only when all the available messages are locked and the accept query takes more than 1 second: but it happens because I saw it and I have the thread dump if anyone want to look at it.
        Hide
        noel Noel J. Bergman added a comment -

        Attached is a first pass at an attempt to fix. I have reviewed it, and provided it directly to Stefano to review and try.

        I have NOT tried it on my system, yet, and won't put it on my production server for a while because I am trying to keep the current process going to verify the lack of memory leaks. But I want to get this out for review ASAP, before I have time to do some bench testing.

        Show
        noel Noel J. Bergman added a comment - Attached is a first pass at an attempt to fix. I have reviewed it, and provided it directly to Stefano to review and try. I have NOT tried it on my system, yet, and won't put it on my production server for a while because I am trying to keep the current process going to verify the lack of memory leaks. But I want to get this out for review ASAP, before I have time to do some bench testing.
        Hide
        norman Norman Maurer added a comment -

        Today i had the same problems here. We had about 1300 messages in queue. After remove the errors all went fine again-

        Show
        norman Norman Maurer added a comment - Today i had the same problems here. We had about 1300 messages in queue. After remove the errors all went fine again-
        Hide
        bago Stefano Bagnara added a comment -

        I just tested this in production and everything seems ok.

        Show
        bago Stefano Bagnara added a comment - I just tested this in production and everything seems ok.
        Hide
        danny@apache.org Danny Angus added a comment -

        Closing issue fixed in released version.

        Show
        danny@apache.org Danny Angus added a comment - Closing issue fixed in released version.

          People

          • Assignee:
            noel Noel J. Bergman
            Reporter:
            bago Stefano Bagnara
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development