SA Bugzilla – Bug 5313
spamd error: 'prefork: select returned -1! recovering: Bad file descriptor'
Last modified: 2008-06-26 02:56:39 UTC
Under average load, Spamassasin displays this in the log file and stops processing spam. prefork: select returned -1! recovering: Bad file descriptor Restarting spamassassin fixes the problem.
We also see the following if it helps. spamd[92408]: prefork: ordered child 94838 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 457. spamd[92408]: prefork: killing failed child 94838 fd=60 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137. spamd[92408]: prefork: killed child 94838 spamd[92408]: prefork: child states: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBI spamd[92408]: prefork: child states: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBI
We have similar issue with spamasssin version 3.1.8 on PC centos 4 linux server when load is grater than about 10: Mon Feb 19 03:05:16 2007 [15480] warn: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647. Mon Feb 19 03:05:16 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on try 2 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Mon Feb 19 03:05:17 2007 [15480] warn: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647. Mon Feb 19 03:05:17 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on try 3 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on try 4 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: giving up at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 626. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: write of ping failed to 28793 fd=6: at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 343. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: killing failed child 28793 fd=6 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137. Mon Feb 19 03:05:18 2007 [15480] warn: prefork: killed child 28793 Mon Feb 19 03:05:18 2007 [15480] warn: prefork: select returned -1! recovering: Bad file descriptor Mon Feb 19 03:05:18 2007 [15480] info: spamd: handled cleanup of child pid 28793 due to SIGCHLD Mon Feb 19 03:05:18 2007 [15480] warn: prefork: select returned -1! recovering: Bad file descriptor Restart fix the problem.
Same problem here. Not an option to have spamassassin crashing every couple of days on a production server. Something when the load gets high triggers this bug. Had to go back to the old "--round-robin".
Can any Developer Comment on this problem? This happens on a daily basis. max children is 120. When all processes are busy and it gets more connections this problem occur. SA Proceses are still running but they do nothing.
I too have having the same problem as the op. While I do not have a solution to the referenced problem, I'm going to try hard coding $self->{overloaded} to have a value of 0 in : SpamdForkScaling.pm For me, it seems this bug pops up when all 120 of my max children were at one point concurrently processing data, but then possibly during the cleanup code is when something gets snagged. In other words, I see the bug hit *after* hitting concurrency and not during nor before. Whether this has relavance to $self->{overloaded} above I not sure, nor sure of it's implications, but going to try it. ;P ~Paul
Just a note for those waiting to get this bug resolved. It was mentioned by someone else in this thread, and I didn't even realize what he was getting at. setting --round-robin fixes the problem apparently from the other bug reports I've read. Give that a try.
Well, it doesn't "fix" the problem but it eliminates the newer child scaling method which is apparently the root of the bug. Unfortunately this means that you will see increased memory usage too.... -Joe (In reply to comment #6) > Just a note for those waiting to get this bug resolved. It was mentioned by > someone else in this thread, and I didn't even realize what he was getting at. > > setting --round-robin fixes the problem apparently from the other bug reports > I've read. Give that a try.
(In reply to comment #4) > Can any Developer Comment on this problem? This happens on a daily basis. > > max children is 120. When all processes are busy and it gets more connections > this problem occur. SA Proceses are still running but they do nothing. My only comment at this time is: do you really have about 4.2+ GB of RAM dedicated to spamd? 120 max children seems insane to me, unless you've got the RAM and probably at least 4 - 8 processor cores.
*** Bug 5149 has been marked as a duplicate of this bug. ***
It is a quad core system with 4GB of ram.
4 GB of ram for 120 children is usually pretty tight. Unless you can confirm that you've got zero swap usage, I'd back it off to 100, or even 80, children. What's your definition of "average load" (from comment #0).
I have not seen this server swap yet. This is from the logfile just before the error. prefork: child states: KKBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBIIIIIIIIIIIIIIIIIIK killing rogue child 65771, failed to write on fd 66: Broken pipe killing failed child 65771 fd=66 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137 prefork: killed child 65772 spamd: handled cleanup of child pid 65772 due to SIGCHLD prefork: select returned -1! recovering: Bad file descriptor last message repeated 119 times
Could this be related to bug 5070 ?
This is not a memory resource problem as Daryl C. W. O'Shea keeps on trying to suggest. I personally have 5 boxes currently setup as an smtp gateway. 1 x FBSD 4.11 machine, 4Gigs of ram, dual proc..... last pid: 65442; load averages: 3.10, 2.44, 3.37 210 processes: 2 running, 206 sleeping, 1 stopped, 1 zombie CPU states: 44.5% user, 0.0% nice, 8.6% system, 0.2% interrupt, 46.8% idle Mem: 537M Active, 848M Inact, 307M Wired, 79M Cache, 199M Buf, 238M Free Swap: 1024M Total, 7164K Used, 1017M Free This machine never ever runs into any problems. It's the other machines that have the problems. They are: 4 x FBSD 6.2, 4Gigs of ram, dual proc..... last pid: 27331; load averages: 1.33, 1.22, 0.96 125 processes: 2 running, 123 sleeping CPU states: 8.8% user, 0.0% nice, 2.6% system, 0.1% interrupt, 88.5% idle Mem: 231M Active, 1304M Inact, 243M Wired, 7160K Cache, 112M Buf, 1731M Free Swap: 8192M Total, 8192M Free Memory usage is light, and the problem happens regardless or load. Even with 60 concurrent processes, my setup actually happens to reach a disk limitation with high amounts of I/O, rather than consuming too much memory and swapping. This *problem* presents itself on any of the FBSD 6.2 machines, while the 4.11 machine runs without spamassassin needing to be restarted. In fact, I caught the problem in action while writing this...FBSD 6.2....here is the problem without me restarting anything...yet: last pid: 30559; load averages: 0.14, 0.79, 0.97 227 processes: 1 running, 224 sleeping, 2 zombie CPU states: 0.1% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle Mem: 509M Active, 1174M Inact, 262M Wired, 92M Cache, 112M Buf, 1479M Free Swap: 8192M Total, 20K Used, 8192M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 487 dnslog 1 -8 0 1208K 576K piperd 2 425:26 0.00% multilog 24560 dnscache 1 76 0 4508K 3952K select 2 348:18 0.00% dnscache 479 qmaill 1 -8 0 1208K 576K piperd 0 52:10 0.00% multilog 357 root 1 76 0 1300K 840K select 0 45:02 0.00% syslogd 551 mysql 8 20 0 56852K 25300K kserel 0 43:35 0.00% mysqld 58235 qmails 1 76 0 1696K 1100K select 0 37:14 0.00% qmail-send 480 qmaill 1 -8 0 1208K 576K piperd 0 19:49 0.00% multilog 58237 qmailr 1 76 0 1328K 768K select 0 10:19 0.00% qmail-rspawn 575 easyagnt 1 4 0 6216K 5444K accept 0 6:49 0.00% perl 467 root 1 8 0 1228K 632K nanslp 0 3:35 0.00% svscan 58238 qmailq 1 -8 0 1196K 564K piperd 0 3:03 0.00% qmail-clean 27670 qmails 1 4 0 1212K 712K accept 0 2:32 0.00% tcpserver 14275 spamd 1 78 0 59936K 55356K select 2 1:23 0.00% perl5.8.8 15536 spamd 1 80 0 58488K 53764K select 2 1:21 0.00% perl5.8.8 58236 root 1 76 0 1200K 576K select 3 1:17 0.00% qmail-lspawn 552 root 1 4 0 3356K 1684K select 0 0:47 0.00% sshd 559 root 1 8 0 1328K 904K nanslp 0 0:28 0.00% cron 92332 root 1 8 0 46776K 43240K nanslp 0 0:22 0.00% perl5.8.8 2152 spamd 1 8 0 52908K 48440K exitho 3 0:16 0.00% perl5.8.8 28179 spamd 1 78 0 51600K 47704K select 2 0:13 0.00% perl5.8.8 2246 spamd 1 8 0 59632K 54208K exitho 3 0:12 0.00% perl5.8.8 45012 admin 1 76 0 6080K 2592K select 0 0:12 0.00% sshd 28125 spamd 1 80 0 51584K 47612K select 0 0:10 0.00% perl5.8.8 92361 root 1 76 0 2744K 1996K CPU0 0 0:09 0.00% top 457 root 1 76 0 1212K 696K select 0 0:08 0.00% usbd 28923 spamd 1 79 0 52096K 47984K select 2 0:06 0.00% perl5.8.8 29049 spamd 1 79 0 52344K 48316K select 2 0:06 0.00% perl5.8.8 2052 spamd 1 8 0 50372K 46548K exitho 2 0:04 0.00% perl5.8.8 29175 spamd 1 77 0 49700K 46052K select 2 0:03 0.00% perl5.8.8 1830 spamd 1 8 0 50200K 46116K exitho 0 0:02 0.00% perl5.8.8 29548 spamd 1 82 0 49136K 45512K select 3 0:02 0.00% perl5.8.8 2268 spamd 1 8 0 48920K 45204K exitho 2 0:01 0.00% perl5.8.8 1511 spamd 1 8 0 47624K 43928K exitho 1 0:01 0.00% perl5.8.8 381 spamd 1 8 0 48472K 44840K exitho 1 0:00 0.00% perl5.8.8 1674 spamd 1 8 0 50024K 45904K exitho 1 0:00 0.00% perl5.8.8 385 spamd 1 8 0 47372K 43688K exitho 1 0:00 0.00% perl5.8.8 1841 spamd 1 8 0 48468K 44780K exitho 3 0:00 0.00% perl5.8.8 1641 spamd 1 8 0 47328K 43648K exitho 1 0:00 0.00% perl5.8.8 and the current smtp connections: smtp-proxy03# fgrep status /var/log/smtpd/current | tail -1 @400000004602147f2c00658c tcpserver: status: 18/120 Message from /var/log/messages: Mar 22 00:26:42 smtp-proxy03 spamd[92332]: prefork: select returned -1! recovering: Bad file descriptor Mar 22 00:27:13 smtp-proxy03 last message repeated 31 times Mar 22 00:29:14 smtp-proxy03 last message repeated 121 times After a restart of spamd, everything is back to normal. Again, take note that this happens on FBSD 6.2 only as far as I can tell and doesn't affect FBSD 4.11. I haven't tested FBSD 5.x, cause FBSD 5.x is garbage. I can provide any additional information as this happens up to something like 10 times a day. Just let me know whatever information I can provide so that this can get fixed. This is a show stopper (almost) ;) Thanks
(In reply to comment #14) > This is not a memory resource problem as Daryl C. W. O'Shea keeps on trying to > suggest. I was actually looking for evidence of a timing issue. I'm pretty sure there are still race conditions in the preforking code. Of the dozens of times people have reported problems with running a large number of children they're almost always using a GB of two of swap so it's sensible to rule out memory issues first. One question about how much RAM, and another to confirm if it's enough (because on systems I've seen with 4GB of RAM and lots of rules and plugins and user configs it is tight) isn't as if I'm hung up on it being swap thrashing or OOM problems. > Memory usage is light, and the problem happens regardless or load. Even with 60 > concurrent processes, my setup actually happens to reach a disk limitation with > high amounts of I/O, rather than consuming too much memory and swapping. This > *problem* presents itself on any of the FBSD 6.2 machines, while the 4.11 > machine runs without spamassassin needing to be restarted. That's useful info. Not surprising at all either. Now that we've got info about affected versions of FreeBSD Justin or myself can actually attempt to track this down.
(In reply to comment #15) > (In reply to comment #14) If I came off wrong, I apologize for that ;) I'm continuing to do some digging and I've filled out the following report with the BSD team: http://www.freebsd.org/cgi/query-pr.cgi?pr=110657 Obviously the changes between 4.11 and 6.2 are significant, and thus far the only real thing I can see in the cleanup code for BSD 6.2 is: Lines 173-197:sys/kern/kern_exit.c while (p->p_lock > 0) msleep(&p->p_lock, &p->p_mtx, PWAIT, "exithold", 0); Notice the "exithold"? That's the state of the processes when this happens. Not sure if the above has any relevance, but it may....just wanted to point it out... Thanks Man
Paul, does the --round-robin switch fix the issue on your machines? If so, it's definitely something in the preforking code; otherwise, I'd suspect somewhere else (like Bayes, since that's very I/O heavy).
(related: bug 5385 is a separate proposal to re-exec spamd in case of this kind of tail-spin bug.)
Wasn't there something about potential problems on multiprocessor system in Perl or SA? His problems are only showing up on the MP systems, I think.
(In reply to comment #19) > Wasn't there something about potential problems on multiprocessor system in > Perl or SA? His problems are only showing up on the MP systems, I think. SA deals with concurrency only at the kernel level, since it (and therefore its perl interpreter) are not natively using threading. so it's still up to the kernel there.
actually, it's not an MP issue. comment 14 states: '1 x FBSD 4.11 machine, 4Gigs of ram, dual proc [...] This machine never ever runs into any problems. It's the other machines that have the problems. They are: 4 x FBSD 6.2, 4Gigs of ram, dual proc [...]' so that's not it; both good and bad machines are MP.
(In reply to comment #21) Justin, I have currently changed the params on one of the machines to use --round-robin. When the machine hangs (if the machine hangs) I'll let you know. I'll let you know tonight either way. >> Wasn't there something about potential problems on multiprocessor system in >> Perl or SA? His problems are only showing up on the MP systems, I think. All machines that I have described thus far are all SMP systems. Again, the only real difference between them being the version of FBSD.
Oh Justin, and since you mentioned bayes because of heavy disk I/O, let me share my config, as it certainly isn't bayes. This config is on every machine (4.11 included): ---------------------------------------- ---------------------------------------- required_score 4.0 add_header spam Flag _YESNOCAPS_ add_header all Status _YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTS_ autolearn=_AUTOLEARN_ version=_VERSION_ report_safe 1 skip_rbl_checks 1 # Use Bayesian classifier (default: 1) # use_bayes 0 use_auto_whitelist 0 # Always score uribl really high score URIBL_BLACK 5.0 # blackberrys suck for spamassassin score LW_STOCK_SPAM4 1.0 score TVD_FW_GRAPHIC_ID3 0.5 score TVD_FW_GRAPHIC_ID1 0.5 score SARE_GIF_STOX 0.5 score PART_CID_STOCK_LESS 0.5 score FB_PENIS 0.5 score SARE_RAND_2 0.5 score TVD_FW_MESG1 0.5 score PART_CID_STOCK 1.7 score SUBJ_LIFE_INSURANCE 1.0 score SARE_SUB_TERM_LIFE 1.0 ### Once spamassassin 3.2 gets installed, remove these next 3 lines. score RATWARE_OUTLOOK_NONAME 0.5 score RATWARE_MS_HASH 0.5 score MSGID_DOLLARS 1.0 ### Seeing this get caught up a lot. Reducing score because clients are complaining. score EXTRA_MPART_TYPE 0.1
ok, so there should be little or no disk I/O from that config. wierd that you're seeing that happening!
In regards to the disk I/O, which isn't necessarily the problem ;P.....I have truss'd the children proccesses and the only things that I can see that seem "stupid" is the excessive stat'ing that it's doing. Each children currently attempt to stat the files more than once. I guess, the children don't cache the initial stat they had done. My comments within '|'. stat("servers.discovery.lst",0x28171900) ERR#2 'No such file or directory' open("/servers.discovery.lst",0x0,0666) ERR#2 'No such file or directory' stat("servers.nomination.lst",0x28171900) ERR#2 'No such file or directory' open("/servers.nomination.lst",0x0,0666) ERR#2 'No such file or directory' stat("servers.catalogue.lst",0x28171900) ERR#2 'No such file or directory' open("/servers.catalogue.lst",0x0,0666) ERR#2 'No such file or directory' | What exactly are we stat'ing here? | stat("",0xbfbfea90) ERR#2 'No such file or directory' | We just did these. Why are we stat'ing them again? | stat("servers.catalogue.lst",0x28171900) ERR#2 'No such file or directory' stat("servers.catalogue.lst",0x28171900) ERR#2 'No such file or directory' stat("servers.discovery.lst",0x28171900) ERR#2 'No such file or directory' stat("servers.discovery.lst",0x28171900) ERR#2 'No such file or directory' stat("/sbin",0xbfbfe930) = 0 (0x0) stat("/bin",0xbfbfe930) = 0 (0x0) stat("/usr/sbin",0xbfbfe930) = 0 (0x0) stat("/usr/bin",0xbfbfe930) = 0 (0x0) | Do we really need to stat /usr/games? | stat("/usr/games",0xbfbfe930) = 0 (0x0) stat("/usr/local/sbin",0xbfbfe930) = 0 (0x0) stat("/usr/local/bin",0xbfbfe930) = 0 (0x0) stat("/root/bin",0xbfbfe930) = 0 (0x0) Anyways, this is irrelevant, but did want the mention the children do repeat the above mutiple times before the child process gets destroyed. Again, I really don't want to turn this problem report into a high I/O spamassassin scenerio as that is less critical than spamassassin just stopping completely. If the above is useful, then great, otherwise disregard it altogether. Thanks Justin
I think all those stat()s should be cached anyway; but the first set is Razor, which you could try disabling to see if it makes a difference? and the second is just perl checking each dir in the $PATH.
Created attachment 3886 [details] patch BTW, please try applying this patch (should apply ok to 3.1.x). I have a feeling it may fix the bug! it also adds some additional lightweight instrumentation that may at least produce useful info in the logs to aid in debugging.
(In reply to comment #27) > Created an attachment (id=3886) [edit] > patch > > BTW, please try applying this patch (should apply ok to 3.1.x). I have a > feeling it may fix the bug! it also adds some additional lightweight > instrumentation that may at least produce useful info in the logs to aid in > debugging. Of the 4 machines experiencing this problem: 1 - Has had the patch applied 1 - Has been set to run with "--round-robin" 2 - Have been untouched. "--round-robin" is a complete memory hog (which I'm sure you know). Other than that, it's stable thus far. I will let you know the results here within the next 24-48 hours. Thanks Justin!
(In reply to comment #28) > (In reply to comment #27) > > Created an attachment (id=3886) [edit] [edit] > > patch > > > > BTW, please try applying this patch (should apply ok to 3.1.x). I have a > > feeling it may fix the bug! it also adds some additional lightweight > > instrumentation that may at least produce useful info in the logs to aid in > > debugging. > > Of the 4 machines experiencing this problem: > > 1 - Has had the patch applied > 1 - Has been set to run with "--round-robin" > 2 - Have been untouched. > > "--round-robin" is a complete memory hog (which I'm sure you know). Other than > that, it's stable thus far. > > I will let you know the results here within the next 24-48 hours. > > Thanks Justin! Summary of patch: Firstly, a snippet from the logs with your patch applied: [...snip...] Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 61509 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 113. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 221. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 IO::Socket=GLOB(0x95f9548)/18 removed from selector at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 381. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 221. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 61508 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 113. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 221. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 IO::Socket=GLOB(0x95f732c)/17 removed from selector at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 381. Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 221. Mar 23 01:08:17 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 221. [...snip...] Good news! The bug has yet to crop up and it's been well over the time period it normally takes for this bug to crop up. This particular bug, seems to be fixed with your patch applied. Certainly if I experience anything different I will append to this bug report with any further findings. In regards to round-robin, (since you had told me to try it) it too seems to fix the problem as suggested, however be aware that even with 4 gigs of memory I still at one point ate up all memory and additionally over 1g of swap. Not a good solution for high volume filters like I have setup. This just an informative message and will be changing from round-robin back to 3.1.8 w/ your patch applied in the morning. ;) The other two remaining machines that I have yet to touch will remain as such until our busiest days see the light of day with your patch applied (Monday/Tuesday). Again, if I happen to run into any problems on these given days, you'll hear from me, otherwise, it seems we are good to go with thus far. Thanks for the help.
great news so far! your plan sounds good. retitling bug to reflect the symptoms. if this patch works, we can apply to 3.2.0, probably for an rc2 tarball.
I may have spoke too soon. Mar 23 14:27:18 smtp-proxy02 spamc[60677]: connect(AF_UNIX) to spamd /tmp/.spamd.sock failed: Connection refused High loads have brought spamd down completely with your patch, which has never happened before. I have restarted the spamd process to see if this is a fluke or something else. At this point I'm not sure and will provide additional information once received.
Paul -- btw, can you post the log output leading up to the spamd crash?
well, some good news from Frank Urban, posted at bug 4590: 'I applied the patch from bug 5313 yesterday morning. Since this time I didn't see this problem [bug 4590] again. Until yesterday if have seen it 3 times a day and was only able to stay alive with the "spamd restart hack ". Looks good now...'
Justin, I think everything is fine at this point. I haven't run into the problem with spamd crapping out again. However, per your request, here is the top <50 lines prior to the crash: Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock: 00001101111011111111111111111111111111 1111111111111111111111110011010110000000000000000000000000 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam Assassin/SpamdForkScaling.pm line 147. Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock: 00001101111011111111111111111111111111 1111111111111011111111110011010110000000000000000000000000 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam Assassin/SpamdForkScaling.pm line 150. Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: killed child 41301 Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: child states: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBIII Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 41324 at /usr/local/lib/perl5/site_perl/ 5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 113. Mar 23 07:37:19 smtp-proxy02 spamd[62786]: spamd: handled cleanup of child pid 41324 due to SIGCHLD Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: oops! no socket for child 41301, killing at /usr/local/l ib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 422. Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: killing failed child 41301 fd=undefined at /usr/local/li b/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 138. Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock: 00001101111011111111111111111111111111 1111111111111011111111110011010110000000000000000000000000 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam Assassin/SpamdForkScaling.pm line 147. Mar 23 07:37:20 smtp-proxy02 spamd[62786]: Can't use an undefined value as a symbol reference at /usr/local/ lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SubProcBackChannel.pm line 109. Mar 23 07:37:23 smtp-proxy02 spamd[41271]: spamd: clean message (0.0/4.0) for qmails:58 in 17.3 seconds, 263 3 bytes. Mar 23 07:37:23 smtp-proxy02 spamd[41271]: spamd: result: . 0 - scantime=17.3,size=2633,user=qmails,uid=58,r equired_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=/tmp/.spamd.sock,mid=<10703230836.AA13165890@mail16. datapipe.com>,autolearn=disabled Mar 23 07:37:23 smtp-proxy02 spamd[41276]: spamd: identified spam (18.5/4.0) for qmails:58 in 17.7 seconds, 9518 bytes. Mar 23 07:37:23 smtp-proxy02 spamd[41276]: spamd: result: Y 18 - HTML_10_20,HTML_FONT_BIG,HTML_MESSAGE,MIME_ HTML_MOSTLY,MPART_ALT_DIFF,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,UNPARSEABLE_RELAY,U RIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL scantime=17.7,size=9518,user=qmails,uid =58,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=/tmp/.spamd.sock,mid=<2600601211.20070323123540 @greatd.com>,autolearn=disabled Mar 23 07:37:26 smtp-proxy02 spamc[41836]: connect(AF_UNIX) to spamd /tmp/.spamd.sock failed: Connection ref used
BTW, If you need me to by chance post the entire log for the day somewhere, I certainly can make those arrangements.
Created attachment 3890 [details] patch rev 2 I was able to reproduce that crash with some additional test instrumentation; it was caused by an attempt to call fileno() on an undef filehamdle. This patch is a revision that avoids it; could you try applying this in place of patch 3886?
Patch has been applied. Will keep you updated regarding any failures.
Created attachment 3891 [details] patch rev 3 urgh. When auditing that code, I spotted another bug -- instead of die()ing when writes fail persistently, it was calling write() in a freudian-slip style, which was essentially a no-op, I think. This meant that some required error-recovery code would never happen, and I think it was possible to wind up with child processes seeming to get stuck in an idle state. This patch is a replacement (again) which fixes that issue, too. I think this would be irrelevant to the current bug, but it's definitely an important fix that may be related. so guys -- could you update the test systems with *this* patch? sorry ;)
anyone run into any failures yet? ;)
Nothing yet, but will certainly report something when/if I find something.
I take it from the silence that this code seems to be working pretty well? ;) good news, if so.
OK, given that no news is good news, I've applied this code to trunk: : jm 50...; svn commit -m "bug 5313: fix spamd 'prefork: select returned -1, recovering: Bad file descriptor' error. do this by ensuring FDs are fully cleaned up (and removed from the selector bitvec) if a child is killed, and a few other related cleanups, to fully clarify the code. also, fix a case where write errors from child to parent were not being caught or logged, due to incorrect use of write() instead of die(). finally, add test instrumentation code, allowing simulation of certain reported system errors. Note: this checkin also contains additional debugging log messages" Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Sending lib/Mail/SpamAssassin/SubProcBackChannel.pm Transmitting file data .. Committed revision 525237. : exit=0 Tue Apr 3 19:34:23 IST 2007; cd /home/jm/ftp/spamassassin : jm 51...; svn commit -m "remove bug 5313 debugging log messages" Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Transmitting file data . Committed revision 525238. (two separate commits, so that the debugging messages can be reenabled simply by reversing commit 525238.) resolving, optimistically. Please comment if you have applied the patches and see anything odd/worrying...
Hi all [..] > so guys -- could you update the test systems with *this* patch? sorry ;) One of our spamds with this patch applied crashed last night. The log says: Apr 3 23:28:45 mxin001 spamd[75510]: spamd: result: Y 12 - DCC_CHECK,DIGEST_MULTIPLE,FROM_LOCAL_NOVOWEL,HTML_MESSAGE,MIME_HTML_ONLY,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_IN_XBL scantime=4.6,size=16618,user=info@eg***.ch,uid=0,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=57784,mid=<B09A8404.722DE83@fastmail.fm>,autolearn=disabled Apr 3 23:28:45 mxin001 spamd[75890]: spamd: identified spam (28.9/5.0) for waelchli@la***.ch:0 in 5.1 seconds, 3187 bytes. Apr 3 23:28:45 mxin001 spamd[75890]: spamd: result: Y 28 - BODY_ENHANCEMENT2,DCC_CHECK,DIGEST_MULTIPLE,DRUGS_DIET,DRUGS_ERECTILE,DRUGS_MANYKINDS,DRUGS_MUSCLE,DRUGS_PAIN,DRUG_ED_GENERIC,HG_HORMONE,HTML_MESSAGE,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,UNPARSEABLE_RELAY,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL scantime=5.1,size=3187,user=waelchli@la***.ch,uid=0,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=57786,mid=<0982154613.20070403212835@gotname.com>,autolearn=disabled Apr 3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 527. Apr 3 23:28:46 mxin001 spamd[75445]: spamd: identified spam (13.0/2.0) for katharina@be***.ch:0 in 5.5 seconds, 2425 bytes. Apr 3 23:28:46 mxin001 spamd[75445]: spamd: result: Y 12 - HTML_FONT_BIG,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,SPF_HELO_SOFTFAIL scantime=5.5,size=2425,user=katharina@be***.ch,uid=0,required_score=2.0,rhost=localhost,raddr=127.0.0.1,rport=57777,mid=<546e663w.3248955@rider.edu>,autolearn=disabled Apr 3 23:28:46 mxin001 spamd[75445]: syswrite() to parent failed: Broken pipe at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 527. If I look for such calls i find: argus mail $ bzgrep mxin001 spamd.0.bz2 | grep 'syswrite()' Apr 3 17:08:41 mxin001 spamd[30071]: syswrite() on closed filehandle GEN7 at /usr/local/lib/perl5/5.8.8/mach/IO/Handle.pm line 453. Apr 3 21:16:53 mxin001 spamd[58574]: syswrite() on closed filehandle GEN34885 at /usr/local/lib/perl5/5.8.8/mach/IO/Handle.pm line 453. Apr 3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 527. Apr 3 23:28:46 mxin001 spamd[75445]: syswrite() to parent failed: Broken pipe at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 527. The system is a FreeBSD 6.2-RELEASE with 8 Intel cores and 4 gigs of RAM. SpamAssassin is SpamAssassin version 3.1.8 running on Perl version 5.8.8 Michael
(In reply to comment #43) > Apr 3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe > at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm > line 527. Did this interfere with scanning, or did the spamd recover from the error condition OK? what should have happened is that the child spamd died, and the parent forked a new one to replace it.
(oh btw, if you could attach (as attachment) a full log leading up to those errors, that'd be cool too. I'm curious as to what could be causing them...)
(In reply to comment #44) [..] > Did this interfere with scanning, or did the spamd recover from the error > condition OK? what should have happened is that the child spamd died, and the > parent forked a new one to replace it. No, it didn't recover. This was the last message from spamd until I restarted it. > (oh btw, if you could attach (as attachment) a full log leading up to those > errors, that'd be cool too. I'm curious as to what could be causing them...) You don't really want that, it's a 1 gig log from yesterday... Besides I cannot provide it to you due to data protection regulations... If you have specific parts of interests I will sift the messages out for you. Michael
> No, it didn't recover. This was the last message from spamd until I restarted it. not so good :( Here's the thing -- I would suspect that a "broken pipe" error is indicative of the parent spamd process *already* being dead, and hence breaking the pipes it had open to the children. So it's a side-effect, not a symptom. > > (oh btw, if you could attach (as attachment) a full log leading up to those > > errors, that'd be cool too. I'm curious as to what could be causing them...) > > You don't really want that, it's a 1 gig log from yesterday... Besides I cannot > provide it to you due to data protection regulations... If you have specific > parts of interests I will sift the messages out for you. can you replace all these lines: Apr 3 23:28:46 mxin001 spamd[75445]: spamd: identified spam (13.0/2.0) for katharina@be***.ch:0 in 5.5 seconds, 2425 bytes. Apr 3 23:28:46 mxin001 spamd[75445]: spamd: result: Y 12 - HTML_FONT_BIG,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,SPF_HELO_SOFTFAIL scantime=5.5,size=2425,user=katharina@be***.ch,uid=0,required_score=2.0,rhost=localhost,raddr=127.0.0.1,rport=57777,mid=<546e663w.3248955@rider.edu>,autolearn=disabled with non-identifying obfuscation, e.g. "[identified spam line]", "[result line]", "[processing message]", and therefore be able to pass it on? basically I don't need to see the normal lines of spamd operation, just the errors plus any progress report info from the prefork subsystem, connection opens/closes, forking, that kind of thing. Alternatively, stated as a "what to sift": here's the kind of thing I want: [19766] info: spamd: connection from localhost [127.0.0.1] at port 59785 [19762] info: prefork: child states: BB [19762] info: spamd: server successfully spawned child process, pid 19784 [19762] info: prefork: server reached --max-children setting, consider raising it [19762] info: spamd: server killed by SIGTERM, shutting down [19762] info: spamd: handled cleanup of child pid 20046 due to SIGCHLD plus any errors. also, I don't need the full log, maybe the most recent 1000 lines I think... thanks ;)
(In reply to comment #47) [..] > Here's the thing -- I would suspect that a "broken pipe" error is indicative of > the parent spamd process *already* being dead, and hence breaking the pipes it > had open to the children. Thats what I think, too. [..] > also, I don't need the full log, maybe the most recent 1000 lines I think... I replaced all email-like parts. You can get the last lines from ftp://ftp.linu.gs/tmp/spamd.out.gz Michael
thanks for that log. very odd! it seems to go like this: Apr 3 23:28:32 mxin001 spamd[58574]: spamd: server successfully spawned child process, pid 75698 Apr 3 23:28:32 mxin001 spamd[75698]: spamd: connection from localhost [127.0.0.1] at port 57715 Apr 3 23:28:33 mxin001 spamd[75698]: spamd: processing message <[...snip...]> for [...snip...]:0 Apr 3 23:28:41 mxin001 spamd[75698]: spamd: identified spam (8.5/4.0) for [...snip...]:0 in 9.3 seconds, 22879 bytes. Apr 3 23:28:41 mxin001 spamd[75698]: spamd: result: Y 8 - DCC_CHECK,DIGEST_MULTIPLE,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_IN_XBL,TRACKER_ID scantime=9.3,size=22879,user=[...snip...],uid=0,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=57715,mid=<[...snip...]>,autolearn=disabled So far so good. Then the error condition from bug 4590 occurs: Apr 3 23:28:41 mxin001 spamd[58574]: prefork: ordered child 75698 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. And recovery takes place: Apr 3 23:28:41 mxin001 spamd[58574]: prefork: killing failed child 75698 fd=12 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 144. Apr 3 23:28:41 mxin001 spamd[58574]: JMD bug5313 deleting sock: 0000011011111111111111110101111110000100000000000000000000000000000000000000000000000000 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 153. Apr 3 23:28:41 mxin001 spamd[58574]: JMD bug5313 deleting sock: 0000011011110111111111110101111110000100000000000000000000000000000000000000000000000000 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 160. Apr 3 23:28:41 mxin001 spamd[58574]: prefork: killed child 75698 Apr 3 23:28:41 mxin001 spamd[58574]: JMD bug5313 child_exited 75698 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 119. Apr 3 23:28:41 mxin001 spamd[58574]: spamd: handled cleanup of child pid 75698 due to SIGCHLD That's all correct. then this: Apr 3 23:28:41 mxin001 spamd[58574]: Argument "CHLD" isn't numeric in numeric eq (==) at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 713. A total mystery! as far as I can see there's no way that this string could get into the {kids} hash that that line examines! Even if it does, it should produce a "?" in this line: Apr 3 23:28:41 mxin001 spamd[58574]: prefork: child states: BBKBBBBBBBBBBBBBBBBSBB BBB As you can see, no "?". Regardless, it should be harmless; the spamd carries on: Apr 3 23:28:41 mxin001 spamd[58574]: spamd: server successfully spawned child process, pid 75913 But that is the last line from the parent spamd (pid 58574) until the "syswrite() to parent failed: Broken pipe" errors at EOF. In other words, the spamd definitely died, silently, some time after that "successful spawn" line. Odd. I'd have to guess something's really wrong in the perl interpreter. :( I do notice that that log contains *lots* of 'ordered child 68857 to accept, but they reported state '1', killing rogue' messages. I was under the impression this was relatively rare, but you seem to get this happening every 15-20 seconds! Is it normally this frequent? If so, the next time it starts happening that frequently, any chance I could get an strace log of the spamd parent and child processes, along with a simultaneous copy of the syslog similar to this one? It'd be *very* useful to get to the bottom of this "state '1'" mystery, which I've been unable to reproduce here so far, and that might help avoid the whole situation...
> I do notice that that log contains *lots* of 'ordered child 68857 to accept, > but they reported state '1', killing rogue' messages. I was under the > impression this was relatively rare, but you seem to get this happening every > 15-20 seconds! Is it normally this frequent? If so, the next time it starts > happening that frequently, any chance I could get an strace log of the spamd > parent and child processes, along with a simultaneous copy of the syslog > similar to this one? The 'killing rogue' messages kept appearing, alhough in decreasing frequency. At the moment they have gone for all. I will keep monitoring (have to :) and supply the your rquested information as soon as it starts happening again. Spamd hasn't crashed yet until the last failure. Interestingly the other servers (same job, different type of hardware) don't show these symptoms. Thanks a lot for your work and have a great easter, Michael
patch rev 3 applied clean and all compiled fine But now, i get a different warning: 2007-04-06 04:49:12 [99625] warn: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 219. 2007-04-06 04:49:14 [99625] warn: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 219. 2007-04-06 04:49:16 [99625] warn: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 219. I'm getting the warning every two seconds, but spamd does seem to be working.
Jeremy, if you read this bug ;) you'll notice that I mentioned 'it also adds some additional lightweight instrumentation that may at least produce useful info in the logs to aid in debugging.' That's what the "JMD" noise is -- totally harmless.
thanks for the reply justin, I realized it was debugging info. I didn't realize I had children exiting every two seconds under no load. Thanks again,
> I realized it was debugging info. I didn't realize I had children exiting > every two seconds under no load. they're not. read the log again. ;) -- 2007-04-06 04:49:16 [99625] warn: JMD bug5313 child_just_exited = 0 at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 219. So every 2 seconds a log message appears noting that the boolean var "child_just_exited" is set to 0, ie. false. In other words, no child has exited. seriously, don't worry about it.
any luck, anyone? comments either way ("works for me" or not) would be welcome, just to get an idea of how well this fix is working in the field....
(In reply to comment #55) > any luck, anyone? comments either way ("works for me" or not) would be welcome, > just to get an idea of how well this fix is working in the field.... We had no more crashes for about 6 days now. It seems to work quite well. Also the messages "prefork: killing rogue child 3490, failed to write on fd" have allmost fully dispeared (Today were three of them).
I put this patch in place about a day after you submitted the final version and took the --round-robin parameter off spamd. It's been working perfectly ever since. Before we were seeing this issue about once a day, so from our standpoint the patch seems to be working well.
what does this mean? Only informational? Apr 11 19:08:14 mail1 spamd[10411]: JMD bug5313 child_just_exited = 0 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 219.
purely debugging; it'll be removed for the "finished" version of the fix. it means no child has exited, iirc.
Does anyone have an estimate when the next verision (3.2.0) will be released with this patch included? Thanks.
(In reply to comment #60) > Does anyone have an estimate when the next verision (3.2.0) will be released > with this patch included? I'm hoping a week or two... this is about the last major bug that is holding up 3.2.0 release candidates. The more "this is good" comments I get, the more likely it is to go quicker ;)
lowering priority and resolving... I think this is looking good, now, since nobody has commented that they are still running into problems with the patch applied.
I was finaly able to capture truss (strace) output while spamd ecountered 'killing rogue's again. There were several periods in the last days when this error kept appearing more frequently. Apr 18 09:12:53 mxin001 spamd[52389]: prefork: ordered child 8598 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:13:53 mxin001 spamd[52389]: prefork: ordered child 10788 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:14:05 mxin001 spamd[52389]: prefork: ordered child 11129 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:14:17 mxin001 spamd[52389]: prefork: ordered child 11507 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:14:34 mxin001 spamd[52389]: prefork: ordered child 11960 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:15:06 mxin001 spamd[52389]: prefork: ordered child 12333 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. Apr 18 09:15:19 mxin001 spamd[52389]: prefork: ordered child 13274 to accept, but they reported state '1', killing rogue at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481. I put the 3mb truss output of the parent process on ftp://ftp.linu.gs/tmp/spamd.truss.out.gz Michael
thanks Michael -- that was very useful! I have a better idea what may be happening now. Could you take a look at bug 5422: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5422 and try the patch there?
Created attachment 3929 [details] for reference; patch rev 3, redone without debugging statements and against 3.1.x HEAD