Issue Details (XML | Word | Printable)

Key: JAMES-603
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Noel J. Bergman
Reporter: Stefano Bagnara
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
JAMES Server

Outgoing spooling stuck over old mails when more than 1000 old mails are present in outgoing.

Created: 01/Sep/06 05:21 PM   Updated: 21/Nov/07 08:31 AM
Return to search
Component/s: Remote Delivery, SpoolManager & Processors
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works spool-fix.patch 2006-09-04 07:07 PM Noel J. Bergman 4 kB

Resolution Date: 08/Sep/06 05:12 PM


 Description  « Hide
scenario:
remote delivery has 6 hours for the third delaytime
insert into the outgoing spool 1000 messages with a last_updated 5 hours ago and error_message 3
start james
send a new message

the first remote delivery thread is stuck in the main accept method because getNextPendingMessage ALWAYS return a new pending message but none of them is ready to be processed. The bad news is that after it finish the 1000 messages from pendingMessages it simply restart the loadPendingMessages and try them again, without waiting.

So 100% CPU used until we are able to spool the 1000 "old" messages and then our james will return to normality.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefano Bagnara made changes - 01/Sep/06 05:22 PM
Field Original Value New Value
Priority Major [ 3 ] Blocker [ 1 ]
Stefano Bagnara added a comment - 01/Sep/06 06:15 PM
The same bug also create this scenario:

You have a single temporary underliverable message in outgoing. Let's say that the next attempt is to be made in 5 hours.
The first remote delivery thread will loop over getNextPendingMessage/loadPendingMessages and will never stop to wait for it to be available (100% CPU).


Stefano Bagnara added a comment - 03/Sep/06 03:39 PM
The worst scenario: everything stuck and not accepting new mail:

I already described how it happens to have the ougoing spool locked and every outgoing thread waiting to obtain the lock.
Now I experienced something worse and I think I got why:
I have 10 spool threads, 10 smtp workers.
I have 9 email in the spool to be remotly delivered.
9 of the 10 spool threads lock the 9 emails from the spool and start waiting to lock the outgoing repository.
The 10th spool threads start an infinite loop over the accept of the main spool because it find 9 mails, but it can't lock them because are being processed by the other threads, so it keeps an infinite lock over the main spool.(this happen because the loadPendingMessages take more than 1 second maybe because the server is already stressing the db with the outgoing thread looping into the accpet)
The first 10 incoming smtp connections will stuck trying to acquire the lock on the main spool to store the messages and you are under DOS.

I clearly remember user reports in the mailing list in the past months/years reporting similar scenario and maybe we finally found the problem.

So this bug also affect the main spool even if it is more rare because mails in the main spool are always acceptable if they are not locked and this happens only when all the available messages are locked and the accept query takes more than 1 second: but it happens because I saw it and I have the thread dump if anyone want to look at it.

Noel J. Bergman added a comment - 04/Sep/06 07:07 PM
Attached is a first pass at an attempt to fix. I have reviewed it, and provided it directly to Stefano to review and try.

I have NOT tried it on my system, yet, and won't put it on my production server for a while because I am trying to keep the current process going to verify the lack of memory leaks. But I want to get this out for review ASAP, before I have time to do some bench testing.

Noel J. Bergman made changes - 04/Sep/06 07:07 PM
Attachment spool-fix.patch [ 12340159 ]
Noel J. Bergman made changes - 04/Sep/06 07:27 PM
Assignee Noel J. Bergman [ noel ]
Noel J. Bergman made changes - 04/Sep/06 07:27 PM
Status Open [ 1 ] In Progress [ 3 ]
Repository Revision Date User Message
ASF #440612 Wed Sep 06 04:38:23 UTC 2006 noel JAMES-603. The salient change is that we push the filter all the way down to the code that processes the ResultSet, and we don't load messages into the cache that aren't accepted by the filter. Unfortunately, we can no longer naively call setMaxRows, since we don't know how many rows we might have to process in order to get to even ANY valid messages, so we'll have to trust the JDBC driver to use cursors properly, rather than buffer a potentially huge ResultSet in memory.
Files Changed
ADD /james/server/branches/v2.3/src/java/org/apache/james/mailrepository/JDBCSpoolRepository.java (from /james/server/trunk/src/java/org/apache/james/mailrepository/JDBCSpoolRepository.java)
MODIFY /james/server/trunk/src/java/org/apache/james/mailrepository/JDBCSpoolRepository.java

Norman Maurer added a comment - 06/Sep/06 03:54 PM
Today i had the same problems here. We had about 1300 messages in queue. After remove the errors all went fine again-

Stefano Bagnara added a comment - 08/Sep/06 05:12 PM
I just tested this in production and everything seems ok.

Stefano Bagnara made changes - 08/Sep/06 05:12 PM
Status In Progress [ 3 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Danny Angus added a comment - 21/Nov/07 08:31 AM
Closing issue fixed in released version.

Danny Angus made changes - 21/Nov/07 08:31 AM
Status Resolved [ 5 ] Closed [ 6 ]