SA Bugzilla – Bug 4590
prefork: syswrite(6) failed
Last modified: 2007-04-03 11:40:15 UTC
Getting this error in maillog: prefork: syswrite(6) failed, retrying... at / usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 554. Then all incoming mail doesn't get processed or delivered.
I'm seeing the same issue. Things work for a while then it'll start failing and we have to restart spamd.
Same thing just happened on my Debian etch system here: Oct 28 01:26:06 server spamd[15848]: prefork: syswrite(6) failed, retrying... at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 554. Oct 28 01:26:16 server last message repeated 2 times
Same thing here on RHEL3: Nov 4 11:19:45 griffin spamd[28185]: prefork: syswrite(18) failed, retrying... at +/usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 554. Nov 4 11:20:20 griffin last message repeated 7 time Cannot connect to spamd once this message has appeared. Restarting gets rid of the messages for a while. The options we're passing to spamd are: SPAMDOPTIONS="-d -i 0.0.0.0 -H /tmp -A 193.128.242.5 -A 193.128.240.22 -A 193.128.240.33 -A 193.128.240.7 -A 193.128.242.4 -A 193.128.240.9 -A 127.0.0.1 - -max-children=25 --max-conn-per-child=200 --min-spare=10"
This is not happening on the RHEL4 box here that is also running SA.
Same here - using the port on FreeBSD 4.7, SA 3.1.0 Forking frustrating! error: usage: $io->syswrite(BUF, LEN [, OFFSET]) at /usr/local/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 544
Same here - Red Hat 9 / SpamAssassin 3.1.0. Problem began immediately after upgrade to 3.1.0. Nov 29 09:59:07 warfield spamd[25460]: prefork: syswrite(8) failed, retrying... at /usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 554. spamd options are: /usr/bin/spamd -d -u exim --min-children 1 --max-children 10 --min-spare 1 --max-spare 3 -H --socketpath=/var/run/spamd.sock
*** Bug 4666 has been marked as a duplicate of this bug. ***
1. use the "--round-robin" switch to avoid this; it appears to be load-related, and at that level of load, round-robin is marginally more efficient anyway. 2. if anyone can provide strace or truss output from *both* the parent and child process when this occurs, I'd be very grateful. haven't been able to repro it here.
Here's a prefork log and a strace of both parent and children: http://cyan.dostech.net/sa-bug4590/prefork.log.syswrite7 http://cyan.dostech.net/sa-bug4590/spamd-strace.syswrite7.trimmed The busy child was a child that timed out an alarm on an open (WHOIS, "whois $domain|"); Let me know if you need more of the strace log or the maillog.
Daryl -- thanks! more info would def be welcome. what was the PID of the busy child? 30625? can you post the entire strace somewhere? where did the whois stuff come from? another plugin?
OK, I've been looking at http://cyan.dostech.net/sa-bug4590/spamd-strace.syswrite7.trimmed . for what it's worth, here's what I can tell so far. It just doesn't make much sense. PIDS: 15882: the spamd parent 19783: a child PID that continues working OK 30625: a child PID that stops processing mail and gets hung FDS: 15882 fd 4 => syslog 15882 fd 7 => control channel for 30625, attached to fd 8 in that process 15882 fd 8 => control channel for 19783, fd 9 ANNOTATED STRACE: child, waiting for events: 30625 00:02:31.541351 select(16, [8], NULL, NULL, {300, 0} <unfinished ...> everything goes well until (event: linenumber): parent, last good write: 12070 15882 00:05:03.547505 write(7, "P....\n", 6 <unfinished ...> 15882 00:05:03.547564 <... write resumed> ) = 6 The "P....\n" data is a ping, to check to ensure that the child can process mails. The child doesn't need to respond to this, it's a one-way operation. child, last successful read: 12070 - 12094 After that write from the parent, the child wakes up from its select(), reads the ping, and goes back to another select(): 30625 00:05:03.547950 <... select resumed> ) = 1 (in [8], left {148, 0}) 15882 00:05:03.548020 rt_sigprocmask(SIG_BLOCK, ~[], <unfinished ...> 30625 00:05:03.548132 read(8, <unfinished ...> 15882 00:05:03.548168 <... rt_sigprocmask resumed> [33], 8) = 0 30625 00:05:03.548265 <... read resumed> "P....\n", 6) = 6 15882 00:05:03.548334 rt_sigprocmask(SIG_BLOCK, [ALRM], <unfinished ...> 30625 00:05:03.548440 read(8, <unfinished ...> 15882 00:05:03.548461 <... rt_sigprocmask resumed> ~[KILL STOP], 8) = 0 30625 00:05:03.548505 <... read resumed> 0x9f81dc0, 6) = -1 EAGAIN (Resource temporarily unavailable) 15882 00:05:03.548529 rt_sigaction(SIGALRM, {0x80a7be8, [], SA_RESTORER, 0x420276f8}, <unfinished ...> 30625 00:05:03.548591 time( <unfinished ...> 15882 00:05:03.548609 <... rt_sigaction resumed> {0x80a6dcc, [], SA_RESTORER, 0x420276f8}, 8) = 0 30625 00:05:03.548700 <... time resumed> NULL) = 1132463103 15882 00:05:03.548720 rt_sigprocmask(SIG_SETMASK, ~[KILL STOP], <unfinished ...> 30625 00:05:03.548792 select(16, [8], NULL, NULL, {300, 0} <unfinished ...> and this is where the problem happens, somewhere between line 12094 and 12854. parent, first failed write: 12854 15882 00:07:35.737337 write(7, "P....\n", 6) = -1 EAGAIN (Resource temporarily unavailable) parent, second failed write: 12874 15882 00:07:41.834433 write(7, "P....\n", 6) = -1 EAGAIN (Resource temporarily unavailable) (many more attempts 5 seconds apart, all fail) child, failed read: 13338 - 13351 30625 00:10:03.561013 <... select resumed> ) = 0 (Timeout) 30625 00:10:03.606427 read(8, 0x9f81dc0, 6) = -1 EAGAIN (Resource temporarily unavailable) 30625 00:10:03.614481 time(NULL) = 1132463403 30625 00:10:03.635766 select(16, [8], NULL, NULL, {1, 0}) = 0 (Timeout) 30625 00:10:04.631238 read(8, 0x9f81dc0, 6) = -1 EAGAIN (Resource temporarily unavailable) 30625 00:10:04.631361 time(NULL) = 1132463404 30625 00:10:04.818517 rt_sigprocmask(SIG_BLOCK, [PIPE], [33], 8) = 0 30625 00:10:04.822736 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0 30625 00:10:04.822854 rt_sigprocmask(SIG_SETMASK, [33], NULL, 8) = 0 30625 00:10:04.822906 rt_sigprocmask(SIG_BLOCK, [PIPE], [33], 8) = 0 30625 00:10:04.822962 rt_sigaction(SIGPIPE, {0x80a7be8, [], SA_RESTORER, 0x420276f8}, {SIG_DFL}, 8) = 0 30625 00:10:04.823016 rt_sigprocmask(SIG_SETMASK, [33], NULL, 8) = 0 30625 00:10:04.945869 select(8, [4], NULL, [4], {0, 0}) = 0 (Timeout) 30625 00:10:04.953285 write(4, "<20>spamd[30625]: prefork: sysre"..., 149) = 149 and it carries on with both sides retrying and failing, until end of strace. So the problem happens somewhere between line 12094 and 12854. However, there's absolutely nothing of note in those lines -- just the usual idle signal-twiddling and 2-second sleeping. If anyone can spot anything, I'd be much obliged! I'm stumped ;)
Created attachment 3284 [details] workaround patch This patch adds some workaround code; instead of retrying indefinitely, it'll retry 3 times then kill the child and allow it to respawn. It's in trunk as r350101. I think the bug will still occur -- it should be a lot less serious with this. If anyone has any ideas, I'm all ears. ;)
Yeah, the whois thing is in a plugin... my $oldalarm = 0; eval { local $SIG{ALRM} = sub { die "__alarm__\n" }; $oldalarm = alarm 5; open (WHOIS, "whois $domain|"); while (<WHOIS>) { $lines++; ... match stuff ... } close (WHOIS); alarm $oldalarm; }; ... which will leave a whois zombie process if it times out. Not sure how I missed that when I wrote it. The busy child was 12301. After the alarm on the whois call, it cotinues and returns a scanned message. The next message it accepts it doesn't scan... "processing message" is the last line maillog for the 12301 child: Nov 19 11:42:51 cyan spamd[15882]: prefork: child states: II Nov 19 11:44:42 cyan spamd[12301]: spamd: connection from smtp.film-tech.net [66.98.221.156] at port 2285 Nov 19 11:44:42 cyan spamd[12301]: spamd: processing message <0563657271.7475497925@mulespub.com> for dcaldwell@domain.com:501 Nov 19 11:44:43 cyan dccproc[17048]: open(/var/dcc/map): Permission denied Nov 19 11:44:49 cyan spamd[12301]: __alarm__ Nov 19 11:44:49 cyan spamd[12301]: __alarm__ Nov 19 11:44:49 cyan spamd[12301]: spamd: identified spam (9.7/5.0) for dcaldwell@domain.com:501 in 7.2 seconds, 1716 bytes. Nov 19 11:44:49 cyan spamd[12301]: spamd: result: Y 9 - BAYES_99,RATWARE_RCVD_PF,URIBL_BLACK scantime=7.2,size=1716,user=dcaldwell@domain.com,uid=501,requ ired_score=5.0,rhost=smtp.film-tech.net,raddr=66.98.221.156,rport=2285,mid=<0563657271.7475497925@mulespub.com>,bayes=0.993899328941404,autolearn=no Nov 19 11:44:49 cyan spamd[15882]: prefork: child states: II Nov 19 11:45:20 cyan spamd[12301]: spamd: connection from smtp.film-tech.net [66.98.221.156] at port 2298 Nov 19 11:45:20 cyan spamd[12301]: spamd: processing message <BFA4B2BD.24969%bart@videofest.org> for jspracklen@domain.com:501 The "dccproc[17048]: open(/var/dcc/map): Permission denied" started happening when I started strace'ing spamd... haven't got around to finding out why. In the case of 30625 (going from memory, I meant to post this the next morning and forgot) I believe it became a zombie process (probably after killing the parent -- I can't remember). There's nothing abnormal in the maillog about the last messages it processed. It successfully returned its last scanned message, but way before the pings started failing, at Nov 19 16:44:12. The prefork log idetifies it (30625) as IDLE the entire time (from 16:44:12 until the last prefork child state log message at 23:58:03).
Complete strace (all 647M of fun, compressed to 76M): http://cyan.dostech.net/sa-bug4590/spamd-strace.syswrite7.gz Would you prefer I disable my whois plugin and run as is, or additionally apply the patch to 3.1.0?
I'm not sure what's up with that strace, then -- child 12301 never appears... I wouldn't worry about it -- I don't think there'll be any useful data in the 640MB strace; I can't see any myself. feel free to rm it.
I'm confused... I see thousands of lines in the full strace for child 12301.
hmm, yes, I see 12301 now. Must have screwed up my searching :(
interesting. However, child 12301 has nothing to do with the "prefork: syswrite(N) failed, retrying..." bug; that child is *definitely* 30625, according to that strace.
Yeah, it's weird. 12301 is hung up at "processing message" and left marked as busy, while 30625 successfully returns its scanned message, and is marked idle, almost 7.5 hours before it stops responding to pings. BTW, do you think this and the "prefork: sysread(8) failed after 300 secs at /usr/local/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling .pm line 501." I'm seeing in bug 4476 is being caused by the same thing?
Hi, I'm a little bit confused what I have to do now to avoid this bug? Should I install the patch? Should I use --round-robin? Or whatelse can I do? Frank
Frank -- nobody has reported any results with the patch, so I have no idea :( has anyone tried the patch?
I've been running with the patch since you made it, but I've only ever seen this problem the one time. It didn't even occur when the server hit a load average of 900+ earlier in the week. Frank -- I'd try it out and, please, let us know how it goes. It looks like it should fix the problem and in my experience it doesn't cause any problems of it's own, so it should be safe to use in production.
OK I installed the patch now. Let us the what will happen. I had the problem 5 times last week. But there was now high load on the server at this time...
Created attachment 3333 [details] Logfile I still have problems, but now with a different look (see attached Logfile) In the night there where some minor problems with seemed to be catched by the patch. But now at 08:15 the filter crashed again (see second part of the log.txt) Frank
the "spamd[4339]: prefork: select returned error on server filehandle: Bad file descriptor" message I can see in my logfile from 03:48 until 08:15 until I restarted the spamd. All mailss seemed to be queued until this time. So the patch does not work correct? Frank
Created attachment 3334 [details] Logfile2 one of my other servers had also problems since this time. There was only one mail activ at this time. So this was shuer not a problem of the load of the server. Frank
I'm still running in this problem every day. So I now installed a cron job witch restart the spamd every houre. I hope this will help in a first step. Frank
(In reply to comment #12) > This patch adds some workaround code; instead of retrying indefinitely, it'll > retry 3 times then kill the child and allow it to respawn. > > It's in trunk as r350101. > > I think the bug will still occur -- it should be a lot less serious with this. > If anyone has any ideas, I'm all ears. ;) I installed this patch a couple of days ago after encountering the exact problem described in this bug. The patch seems to have fully resolved the problems that I was having, where mail was very slow to get processed. The patch seems reasonable to me, in that it averts a possible endless loop in the code.
Created attachment 3347 [details] log with patch #3284 Justin -- It looks like the patch prevented the loop but failed to "forget" the child (I haven't really looked at the code). I've got a 2GB strace log that I can extract a section out of if you want it, along with complete mail logs and strace if you're interested.
If your redo or add to the patch, you might want to check the following comment: # give it 1 second to recover. we retry indefinitely. I expect that the 'retry indefinitely' part is wrong.
Tom, thanks -- corrected now in my checkout, I'll commit it RSN.
There are some major bugfixes for the apache-style preforking subsystem in bug 4696. If you are seeing this issue, please try applying the patch in that bug, and see if it helps! (fwiw that minor comment thing is also fixed in the patch in that bug.)
OK, the fix for bug 4696 now includes this fix, too; if you're seeing this issue, go to that bug and apply the patch immediately ;)
Now that bug 4696 is fixed in the 3.1 branch, I think this can be closed as fixed.
I don't think this is fixed. Reopening and targetting at 3.1.2.
Hello, is there still anyone working on this? I got the same error again at the weekend now in Version 3.1.1. Mar 18 05:04:16 mail1 spamd[15430]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:16 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 2 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. .... Mar 18 05:04:17 mail1 spamd[15430]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:17 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 3 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:18 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 4 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. Mar 18 05:04:18 mail1 spamd[15430]: prefork: giving up at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 595. Mar 18 05:04:18 mail1 spamd[15430]: prefork: write of ping failed to 19194 fd=21: at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 330. Mar 18 05:04:18 mail1 spamd[15430]: prefork: killing failed child 19194 fd=21 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 127. Mar 18 05:04:18 mail1 spamd[15430]: prefork: killed child 19194 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 141. Mar 18 05:04:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:18 mail1 spamd[15430]: spamd: handled cleanup of child pid 19194 due to SIGCHLD Mar 18 05:04:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:18 mail1 postfix/smtpd[13451]: lost connection after DATA from unknown[60.50.129.32] Mar 18 05:04:18 mail1 postfix/smtpd[13451]: disconnect from unknown [60.50.129.32] Mar 18 05:04:19 mail1 spamd[15430]: prefork: select returned - 1! recovering: Bad file descriptor So I will install againe a cronjob witch restart the spamd every houre... Frank
Frank, sorry for the delay in looking at this, but it isn't clear from the above snippet that this actually causes a failure of the spamd parent. It only shows a screwed up child that gets killed off. Are you still seeing this? Can you provide more info?
Hi Daryl, great that you will try to take a look on this... I think the problem is still there. My cronjob restart the spamd every houre and this helps. Whrn I remeber it right, the spamd has done nothing anymore after the last entry in my logfile from Mar 18 05:04:18 and I have taken this "snip" at arround 11:00. So I switched of now again the cronjob. What should I do when the spamd died? Frank
(In reply to comment #38) > So I switched of now again the cronjob. What should I do when the spamd died? For now I'm just looking to confirm that the following is indeed the last thing logged by spamd: [60.50.129.32] Mar 18 05:04:19 mail1 spamd[15430]: prefork: select returned - 1! recovering: Bad file descriptor
Hello, that should be all I can see in the log. After Mar 18 05:04:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor I can see only this kind of line expect that what you can see below. Since I switched of my cronjob yesterday nothing happened until now. Maybe there was a change in Version 3.1.3 witch solve the problem? Before I got this problem several times a day. Frank Mar 18 05:04:02 mail1 spamd[1550]: spamd: result: Y 24 - BAYES_00,DCC_CHECK,DIGEST_MULTIPLE,ONLINE_PHARMACY,RAZOR2_CF_RANGE_51_100,RAZOR 2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,RCVD_IN_N JABL_PROXY,RCVD_IN_XBL,SPF_NEUTRAL,URIBL_JP_SURBL,URIBL_SBL,URIBL_SC_SURBL scantime=2.1,size=1078,user=filter,uid=1002,required_score=6.0,rhost=localhost, raddr=127.0.0.1,rport=37447,mid=<000301c64a39 $35ea32ea$64225a40@hmccgw>,bayes=0,autolearn=no Mar 18 05:04:02 mail1 spamd[15430]: prefork: child states: IBBBI Mar 18 05:04:16 mail1 spamd[15430]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:16 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 2 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. Mar 18 05:04:17 mail1 spamd[15430]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:17 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 3 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. Mar 18 05:04:18 mail1 spamd[15430]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 616. Mar 18 05:04:18 mail1 spamd[15430]: prefork: syswrite(21) to 19194 failed on try 4 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 593. Mar 18 05:04:18 mail1 spamd[15430]: prefork: giving up at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 595. Mar 18 05:04:18 mail1 spamd[15430]: prefork: write of ping failed to 19194 fd=21: at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 330. Mar 18 05:04:18 mail1 spamd[15430]: prefork: killing failed child 19194 fd=21 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 127. Mar 18 05:04:18 mail1 spamd[15430]: prefork: killed child 19194 at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 141. Mar 18 05:04:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:18 mail1 spamd[15430]: spamd: handled cleanup of child pid 19194 due to SIGCHLD Mar 18 05:04:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:19 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:20 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:21 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:22 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:23 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:04:24 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor ... ar 18 05:09:17 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:18 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:19 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:19 mail1 spamd[1550]: prefork: sysread(10) failed after 300 secs at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 544. Mar 18 05:09:19 mail1 spamd[20338]: prefork: sysread(7) failed after 300 secs at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 544. Mar 18 05:09:19 mail1 spamd[15430]: spamd: handled cleanup of child pid 20338 due to SIGCHLD Mar 18 05:09:19 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:19 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:20 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 05:09:21 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor ... Mar 18 06:07:29 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 06:07:30 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 06:07:30 mail1 spamc[29254]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection timed out Mar 18 06:07:30 mail1 spamc[29256]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection timed out Mar 18 06:07:30 mail1 spamc[29258]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection timed out Mar 18 06:07:31 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 06:07:32 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 06:07:33 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor Mar 18 06:07:34 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor
Created attachment 3593 [details] Part of maillog now it happened again. Here the newest extract from my logfile Frank SpamAssassin 3.1.3
Created attachment 3594 [details] DOS readable version DOS readable version. changed with unix2dos
Repeatedly same here, Debian/Linux sid: Aug 29 21:26:24 Yikes spamd[8608]: prefork: retrying syswrite(): Resource temporarily unavailable at usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 623. Aug 29 21:26:24 Yikes spamd[8608]: prefork: syswrite(8) to 1652 failed on try 2 at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 600. Aug 29 21:26:25 Yikes spamd[8608]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 623. Aug 29 21:26:25 Yikes spamd[8608]: prefork: syswrite(8) to 1652 failed on try 3 at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 600. Aug 29 21:26:26 Yikes spamd[8608]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 623. Aug 29 21:26:26 Yikes spamd[8608]: prefork: syswrite(8) to 1652 failed on try 4 at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 600. Aug 29 21:26:26 Yikes spamd[8608]: prefork: giving up at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 602. Aug 29 21:26:26 Yikes spamd[8608]: prefork: write of ping failed to 1652 fd=8: at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 333. Aug 29 21:26:26 Yikes spamd[8608]: prefork: killing failed child 1652 fd=8 at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 127. Aug 29 21:26:26 Yikes spamd[8608]: prefork: killed child 1652 at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 141. Aug 29 21:26:26 Yikes spamd[8608]: spamd: handled cleanup of child pid 1652 due to SIGCHLD Aug 29 21:26:26 Yikes spamd[8608]: prefork: select returned -1! recovering: Bad file descriptor Aug 29 21:26:27 Yikes spamd[8608]: prefork: select returned -1! recovering: Bad file descriptor Aug 29 21:26:28 Yikes spamd[8608]: prefork: select returned -1! recovering: Bad file descriptor Aug 29 21:26:29 Yikes spamd[8608]: prefork: select returned -1! recovering: Bad file descriptor Aug 29 21:26:30 Yikes spamd[8608]: prefork: select returned -1! recovering: Bad file descriptor until spamd restarted. # spamd --version SpamAssassin Server version 3.1.4 running on Perl 5.8.8 with SSL support (IO::Socket::SSL 0.998)
And happens again. Guys, please some kind soul leave a comment whether there is some planned improvement, or some workaround I could patch in. I tried brifly to examine the code, but it is so nicely modular that I would have to examine most of SA's source to come up with a hack, for which I have been lacking the time. If I guess right, and this code is in the spamd server, would it be possible to completely restart the whole mess? (It seems to be dead anyway, so kill everything and restart cleanly, if there is no better idea. How to do that, apart from die'ing and leaving extrnal restarter to take care of the restart?) If this is in a child, would it be possible to suicide and leave the system to go on? [ Some solution would be extremely nice since this is a really major pain in he ass... it just stops the whole spamd, which stalls all mailqueues, which basically stops everything smtp around, which is not nice. I hate morning phonecalls, how about you? :) ]
Re: comment #44: Peter, I've been using "--round-robin" ever since it was reported as a workaround way back in comment #8. Haven't hit this problem since, and it's been almost a year. If this is causing you grief, I say just throw that option in.
(In reply to comment #45) > Peter, I've been using "--round-robin" ever since it was reported as a workaround way back in comment Ermm, the problem is that as far as I remember RR spawns the maximal number of SA, which is pretty memory-consuming in case of high maximum. Imagine a system with average 4-5 spamd and the maximum of around 50 or more, where 50 allocates pretty extensive amount of memory (and it grows substantially over time, doesn't it for you?), which is usually very rarely needed. But you're right, maybe this is a workaround. Other workaround is attached, pretty much an "if you've got a hammer everything looks like a nail" approach. :) Thanks.
Created attachment 3700 [details] spamd restart hack Run this as root, fill in admin email and check spamd restart command at the end. It will restart SA in case of the problem. Hopefully. Not thoroughly tested, use at your own risk.
I think that the problem here is that none of the dev's can reliably reproduce this bug and none of the traces provided allow them to track down the exact cause of the problem. What is needed is for someone who can reproduce this bug to dig in and track down the cause of the bug.
The spamd restart program work great. I had 3 restarts this night. Can someone give me a hint how I can include the hostname into the informationmail? system "echo 'SA restart on" . qx("hostname") . " at " . scalar (localtime) . "' | mail $adminemail -s spamd-sux"; is that correct? is this information helpfull? mail3:/var/log # grep "spamd\[19736\]" mail Sep 28 00:15:59 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:17:12 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:17:22 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:17:49 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:18:49 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:19:03 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:19:18 mail3 spamd[19736]: prefork: child states: IBBB Sep 28 00:19:22 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:20:12 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:20:32 mail3 spamd[19736]: prefork: child states: BBBB Sep 28 00:20:32 mail3 spamd[19736]: spamd: server successfully spawned child process, pid 6176 Sep 28 00:20:32 mail3 spamd[19736]: prefork: child states: BBBIB Sep 28 00:20:35 mail3 spamd[19736]: prefork: child states: IBBIB Sep 28 00:20:37 mail3 spamd[19736]: prefork: child states: IBBIB Sep 28 00:20:38 mail3 spamd[19736]: prefork: child states: IIBIB Sep 28 00:20:38 mail3 spamd[19736]: spamd: handled cleanup of child pid 6176 due to SIGCHLD Sep 28 00:20:38 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:20:42 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:20:56 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:21:10 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:21:15 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:21:41 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:21:43 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:22:00 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:22:09 mail3 spamd[19736]: prefork: child states: IIBB Sep 28 00:22:11 mail3 spamd[19736]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Sep 28 00:22:11 mail3 spamd[19736]: prefork: syswrite(12) to 6120 failed on try 2 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 601. Sep 28 00:22:12 mail3 spamd[19736]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Sep 28 00:22:12 mail3 spamd[19736]: prefork: syswrite(12) to 6120 failed on try 3 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 601. Sep 28 00:22:13 mail3 spamd[19736]: prefork: retrying syswrite(): Resource temporarily unavailable at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 624. Sep 28 00:22:13 mail3 spamd[19736]: prefork: syswrite(12) to 6120 failed on try 4 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 601. Sep 28 00:22:13 mail3 spamd[19736]: prefork: giving up at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 603. Sep 28 00:22:13 mail3 spamd[19736]: prefork: write of ping failed to 6120 fd=12: at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 334. Sep 28 00:22:13 mail3 spamd[19736]: prefork: killing failed child 6120 fd=12 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 128. Sep 28 00:22:13 mail3 spamd[19736]: prefork: killed child 6120 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 142. Sep 28 00:22:13 mail3 spamd[19736]: prefork: select returned -1! recovering: Bad file descriptor Sep 28 00:22:13 mail3 spamd[19736]: spamd: handled cleanup of child pid 6120 due to SIGCHLD Sep 28 00:22:13 mail3 spamd[19736]: prefork: select returned -1! recovering: Bad file descriptor Sep 28 00:22:14 mail3 spamd[19736]: spamd: server killed by SIGTERM, shutting down
--debug output, or strace, would be useful I think...
related to bug 5149?
no, I don't think they're related.
I have been having this same issue on one of my boxen - SA 3.1.3 We also have clamAV installed - I don't recall this issue happening before we installed ClamAV.
Frank, Peter, and anyone else seeing messages like this: Sep 28 00:22:13 mail3 spamd[19736]: prefork: killed child 6120 at /usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 142. Sep 28 00:22:13 mail3 spamd[19736]: prefork: select returned -1! recovering: Bad file descriptor Sep 28 00:22:13 mail3 spamd[19736]: spamd: handled cleanup of child pid 6120 due to SIGCHLD Sep 28 00:22:13 mail3 spamd[19736]: prefork: select returned -1! recovering: Bad file descriptor I think you're seeing bug 5313: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5313 (sorry Kai, I think you were right after all.) Please try the patch in that bug and see if it helps.
Seeing this problem on FreeBSD 4.11, SpamAssassin 3.1.8 with PERL 5.8.5. Seems to happen at completely random times. The only common thing I can tell is that I see some of the child prefork status in "S" just after this happens. When this happens it doesn't seem to effect spamd as a whole, just the child, which I assume gets restarted? Mar 26 13:05:44 scrub1-b spamd[69396]: prefork: child states: IIIBIIIIII Mar 26 13:05:45 scrub1-b spamd[69396]: prefork: child states: BIIIIIIIII Mar 26 13:05:47 scrub1-b spamd[69396]: prefork: child states: IBBBIIIIII Mar 26 13:05:48 scrub1-b spamd[69396]: prefork: child states: IIBBIIIIII Mar 26 13:05:49 scrub1-b spamd[69396]: prefork: child states: BIIBIIIIII Mar 26 13:05:50 scrub1-b spamd[69396]: prefork: child states: BIBBBIIIII Mar 26 13:05:51 scrub1-b spamd[69396]: prefork: child states: IIBBBIIIII Mar 26 13:05:51 scrub1-b spamd[69396]: prefork: child states: IIIBBIIIII Mar 26 13:05:53 scrub1-b spamd[69396]: prefork: child states: BBIBIIIIII Mar 26 13:05:53 scrub1-b spamd[69396]: prefork: child states: BBIIIIIIII Mar 26 13:05:54 scrub1-b spamd[69396]: prefork: child states: BIBBIIIIII Mar 26 13:05:56 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIII Mar 26 13:05:56 scrub1-b spamd[69396]: prefork: child states: IBBIBIIIII Mar 26 13:05:57 scrub1-b spamd[69396]: prefork: child states: BBIIBIIIII Mar 26 13:05:58 scrub1-b spamd[69396]: prefork: child states: IBBIBIIIII Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: IBBIBIIIII Mar 26 13:05:59 scrub1-b spamd[70957]: prefork: sysread(16) failed after 300 secs at /usr/ local/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 575. Mar 26 13:05:59 scrub1-b spamd[70956]: prefork: sysread(16) failed after 300 secs at /usr/ local/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 575. Mar 26 13:05:59 scrub1-b spamd[70952]: prefork: sysread(16) failed after 300 secs at /usr/ local/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 575. Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIII Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIII Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIII Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIISS Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIIS Mar 26 13:05:59 scrub1-b spamd[69396]: prefork: child states: BBBIBIIIII Mar 26 13:06:00 scrub1-b spamd[69396]: prefork: child states: BBBBIIIIII Mar 26 13:06:00 scrub1-b spamd[69396]: prefork: child states: BIBBIIIIII Mar 26 13:06:00 scrub1-b spamd[69396]: prefork: child states: BIIBIIIIII Mar 26 13:06:02 scrub1-b spamd[69396]: prefork: child states: BBIIIIIIII
Robert, that's not this bug.
I applied the patch from bug 5313 yesterday morning. Since this time I didn't see this problem again. Until yesterday if have seen it 3 times a day and was only able to stay alive with the "spamd restart hack ". Looks good now... Frank
I'd appreciate it if anyone else running into prefork-related errors, try out the latest patch at bug 5313 and report success/failure....
OK, I'm marking this resolved, as (as far as I can tell) all the outstanding reports are actually instances of bug 5313, which is now resolved ;) reminder: Mar 18 06:07:34 mail1 spamd[15430]: prefork: select returned -1! recovering: Bad file descriptor if you are seeing that message, please go to bug 5313, not this one. in fact, any further *new* prefork bugs should probably be opened as new bugs rather than comments here, to avoid confusion...