|
The worst scenario: everything stuck and not accepting new mail:
I already described how it happens to have the ougoing spool locked and every outgoing thread waiting to obtain the lock. Now I experienced something worse and I think I got why: I have 10 spool threads, 10 smtp workers. I have 9 email in the spool to be remotly delivered. 9 of the 10 spool threads lock the 9 emails from the spool and start waiting to lock the outgoing repository. The 10th spool threads start an infinite loop over the accept of the main spool because it find 9 mails, but it can't lock them because are being processed by the other threads, so it keeps an infinite lock over the main spool.(this happen because the loadPendingMessages take more than 1 second maybe because the server is already stressing the db with the outgoing thread looping into the accpet) The first 10 incoming smtp connections will stuck trying to acquire the lock on the main spool to store the messages and you are under DOS. I clearly remember user reports in the mailing list in the past months/years reporting similar scenario and maybe we finally found the problem. So this bug also affect the main spool even if it is more rare because mails in the main spool are always acceptable if they are not locked and this happens only when all the available messages are locked and the accept query takes more than 1 second: but it happens because I saw it and I have the thread dump if anyone want to look at it. Attached is a first pass at an attempt to fix. I have reviewed it, and provided it directly to Stefano to review and try.
I have NOT tried it on my system, yet, and won't put it on my production server for a while because I am trying to keep the current process going to verify the lack of memory leaks. But I want to get this out for review ASAP, before I have time to do some bench testing. Today i had the same problems here. We had about 1300 messages in queue. After remove the errors all went fine again-
I just tested this in production and everything seems ok.
Closing issue fixed in released version.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You have a single temporary underliverable message in outgoing. Let's say that the next attempt is to be made in 5 hours.
The first remote delivery thread will loop over getNextPendingMessage/loadPendingMessages and will never stop to wait for it to be available (100% CPU).