Bug 5313 - spamd error: 'prefork: select returned -1! recovering: Bad file descriptor'
Summary: spamd error: 'prefork: select returned -1! recovering: Bad file descriptor'
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.1.7
Hardware: PC FreeBSD
: P5 critical
Target Milestone: 3.2.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 5149 (view as bug list)
Depends on:
Blocks: 4590
  Show dependency tree
 
Reported: 2007-01-26 07:39 UTC by Mike
Modified: 2008-06-26 02:56 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
patch patch None Justin Mason [HasCLA]
patch rev 2 patch None Justin Mason [HasCLA]
patch rev 3 patch None Justin Mason [HasCLA]
for reference; patch rev 3, redone without debugging statements and against 3.1.x HEAD patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mike 2007-01-26 07:39:25 UTC
Under average load, Spamassasin displays this in the log file and stops
processing spam.
prefork: select returned -1! recovering: Bad file descriptor

Restarting spamassassin fixes the problem.
Comment 1 Mike 2007-01-26 09:34:12 UTC
We also see the following if it helps.

spamd[92408]: prefork: ordered child 94838 to accept, but they reported state '1',
killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 457.

spamd[92408]: prefork: killing failed child 94838 fd=60 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137.

spamd[92408]: prefork: killed child 94838
spamd[92408]: prefork: child states:
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBI
spamd[92408]: prefork: child states:
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBI
Comment 2 Jacob 2007-02-20 09:40:08 UTC
We have similar issue with spamasssin version 3.1.8 on PC centos 4 linux server
when load is grater than about 10:

Mon Feb 19 03:05:16 2007 [15480] warn: prefork: retrying syswrite(): Resource
temporarily unavailable at
/usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647.
Mon Feb 19 03:05:16 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on
try 2 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 624.
Mon Feb 19 03:05:17 2007 [15480] warn: prefork: retrying syswrite(): Resource
temporarily unavailable at
/usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647.
Mon Feb 19 03:05:17 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on
try 3 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 624.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: retrying syswrite(): Resource
temporarily unavailable at
/usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 647.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: syswrite(6) to 28793 failed on
try 4 at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 624.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: giving up at
/usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 626.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: write of ping failed to 28793
fd=6: at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 343.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: killing failed child 28793 fd=6
at /usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137.
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: killed child 28793
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: select returned -1! recovering:
Bad file descriptor
Mon Feb 19 03:05:18 2007 [15480] info: spamd: handled cleanup of child pid 28793
due to SIGCHLD
Mon Feb 19 03:05:18 2007 [15480] warn: prefork: select returned -1! recovering:
Bad file descriptor

Restart fix the problem.
Comment 3 Joe Richards 2007-02-23 23:10:21 UTC
Same problem here. Not an option to have spamassassin crashing every couple of
days on a production server. Something when the load gets high triggers this
bug. Had to go back to the old "--round-robin".
Comment 4 Mike 2007-02-24 07:38:02 UTC
Can any Developer Comment on this problem? This happens on a daily basis.

max children is 120. When all processes are busy and it gets more connections
this problem occur. SA Proceses are still running but they do nothing. 
Comment 5 Paul 2007-02-27 23:59:14 UTC
I too have having the same problem as the op.  While I do not have a solution to
the referenced problem, I'm going to try hard coding

$self->{overloaded} to have a value of 0 in :

SpamdForkScaling.pm

For me, it seems this bug pops up when all 120 of my max children were at one
point concurrently processing data, but then possibly during the cleanup code is
when something gets snagged.

In other words, I see the bug hit *after* hitting concurrency and not during nor
before.  Whether this has relavance to $self->{overloaded} above I not sure, nor
sure of it's implications, but going to try it.  ;P

~Paul
Comment 6 Paul 2007-02-28 00:31:47 UTC
Just a note for those waiting to get this bug resolved.  It was mentioned by
someone else in this thread, and I didn't even realize what he was getting at.

setting --round-robin fixes the problem apparently from the other bug reports
I've read.  Give that a try.
Comment 7 Joe Richards 2007-02-28 21:21:51 UTC
Well, it doesn't "fix" the problem but it eliminates the newer child scaling
method which is apparently the root of the bug. Unfortunately this means that
you will see increased memory usage too....

-Joe

(In reply to comment #6)
> Just a note for those waiting to get this bug resolved.  It was mentioned by
> someone else in this thread, and I didn't even realize what he was getting at.
> 
> setting --round-robin fixes the problem apparently from the other bug reports
> I've read.  Give that a try.

Comment 8 Daryl C. W. O'Shea 2007-03-05 12:11:33 UTC
(In reply to comment #4)
> Can any Developer Comment on this problem? This happens on a daily basis.
> 
> max children is 120. When all processes are busy and it gets more connections
> this problem occur. SA Proceses are still running but they do nothing. 

My only comment at this time is: do you really have about 4.2+ GB of RAM
dedicated to spamd?  120 max children seems insane to me, unless you've got the
RAM and probably at least 4 - 8 processor cores.
Comment 9 Daryl C. W. O'Shea 2007-03-05 12:18:41 UTC
*** Bug 5149 has been marked as a duplicate of this bug. ***
Comment 10 Mike 2007-03-19 18:42:17 UTC
It is a quad core system with 4GB of ram. 
Comment 11 Daryl C. W. O'Shea 2007-03-19 18:54:43 UTC
4 GB of ram for 120 children is usually pretty tight.  Unless you can confirm
that you've got zero swap usage, I'd back it off to 100, or even 80, children.

What's your definition of "average load" (from comment #0).
Comment 12 Mike 2007-03-20 11:43:55 UTC
I have not seen this server swap yet. This is from the logfile
just before the error.

prefork: child states: KKBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBIIIIIIIIIIIIIIIIIIK
killing rogue child 65771, failed to write on fd 66: Broken pipe
killing failed child 65771 fd=66 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 137
prefork: killed child 65772
spamd: handled cleanup of child pid 65772 due to SIGCHLD
prefork: select returned -1! recovering: Bad file descriptor
last message repeated 119 times

Comment 13 David W Noon 2007-03-21 17:14:15 UTC
Could this be related to bug 5070 ?
Comment 14 Paul 2007-03-21 22:33:47 UTC
This is not a memory resource problem as Daryl C. W. O'Shea keeps on trying to
suggest.

I personally have 5 boxes currently setup as an smtp gateway.

1 x FBSD 4.11 machine, 4Gigs of ram, dual proc.....

last pid: 65442;  load averages:  3.10,  2.44,  3.37                           
                                                                               
                  
210 processes: 2 running, 206 sleeping, 1 stopped, 1 zombie
CPU states: 44.5% user,  0.0% nice,  8.6% system,  0.2% interrupt, 46.8% idle
Mem: 537M Active, 848M Inact, 307M Wired, 79M Cache, 199M Buf, 238M Free
Swap: 1024M Total, 7164K Used, 1017M Free

This machine never ever runs into any problems.  It's the other machines that
have the problems.  They are:

4 x FBSD 6.2, 4Gigs of ram, dual proc.....

last pid: 27331;  load averages:  1.33,  1.22,  0.96                           
                                                                               
                  
125 processes: 2 running, 123 sleeping
CPU states:  8.8% user,  0.0% nice,  2.6% system,  0.1% interrupt, 88.5% idle
Mem: 231M Active, 1304M Inact, 243M Wired, 7160K Cache, 112M Buf, 1731M Free
Swap: 8192M Total, 8192M Free

Memory usage is light, and the problem happens regardless or load.  Even with 60
concurrent processes, my setup actually happens to reach a disk limitation with
high amounts of I/O, rather than consuming too much memory and swapping.  This
*problem* presents itself on any of the FBSD 6.2 machines, while the 4.11
machine runs without spamassassin needing to be restarted.


In fact, I caught the problem in action while writing this...FBSD 6.2....here is
the problem without me restarting anything...yet:

last pid: 30559;  load averages:  0.14,  0.79,  0.97                           
                                                                               
                 
227 processes: 1 running, 224 sleeping, 2 zombie
CPU states:  0.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Mem: 509M Active, 1174M Inact, 262M Wired, 92M Cache, 112M Buf, 1479M Free
Swap: 8192M Total, 20K Used, 8192M Free

  PID USERNAME      THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
  487 dnslog          1  -8    0  1208K   576K piperd 2 425:26  0.00% multilog
24560 dnscache        1  76    0  4508K  3952K select 2 348:18  0.00% dnscache
  479 qmaill          1  -8    0  1208K   576K piperd 0  52:10  0.00% multilog
  357 root            1  76    0  1300K   840K select 0  45:02  0.00% syslogd
  551 mysql           8  20    0 56852K 25300K kserel 0  43:35  0.00% mysqld
58235 qmails          1  76    0  1696K  1100K select 0  37:14  0.00% qmail-send
  480 qmaill          1  -8    0  1208K   576K piperd 0  19:49  0.00% multilog
58237 qmailr          1  76    0  1328K   768K select 0  10:19  0.00% qmail-rspawn
  575 easyagnt        1   4    0  6216K  5444K accept 0   6:49  0.00% perl
  467 root            1   8    0  1228K   632K nanslp 0   3:35  0.00% svscan
58238 qmailq          1  -8    0  1196K   564K piperd 0   3:03  0.00% qmail-clean
27670 qmails          1   4    0  1212K   712K accept 0   2:32  0.00% tcpserver
14275 spamd           1  78    0 59936K 55356K select 2   1:23  0.00% perl5.8.8
15536 spamd           1  80    0 58488K 53764K select 2   1:21  0.00% perl5.8.8
58236 root            1  76    0  1200K   576K select 3   1:17  0.00% qmail-lspawn
  552 root            1   4    0  3356K  1684K select 0   0:47  0.00% sshd
  559 root            1   8    0  1328K   904K nanslp 0   0:28  0.00% cron
92332 root            1   8    0 46776K 43240K nanslp 0   0:22  0.00% perl5.8.8
 2152 spamd           1   8    0 52908K 48440K exitho 3   0:16  0.00% perl5.8.8
28179 spamd           1  78    0 51600K 47704K select 2   0:13  0.00% perl5.8.8
 2246 spamd           1   8    0 59632K 54208K exitho 3   0:12  0.00% perl5.8.8
45012 admin           1  76    0  6080K  2592K select 0   0:12  0.00% sshd
28125 spamd           1  80    0 51584K 47612K select 0   0:10  0.00% perl5.8.8
92361 root            1  76    0  2744K  1996K CPU0   0   0:09  0.00% top
  457 root            1  76    0  1212K   696K select 0   0:08  0.00% usbd
28923 spamd           1  79    0 52096K 47984K select 2   0:06  0.00% perl5.8.8
29049 spamd           1  79    0 52344K 48316K select 2   0:06  0.00% perl5.8.8
 2052 spamd           1   8    0 50372K 46548K exitho 2   0:04  0.00% perl5.8.8
29175 spamd           1  77    0 49700K 46052K select 2   0:03  0.00% perl5.8.8
 1830 spamd           1   8    0 50200K 46116K exitho 0   0:02  0.00% perl5.8.8
29548 spamd           1  82    0 49136K 45512K select 3   0:02  0.00% perl5.8.8
 2268 spamd           1   8    0 48920K 45204K exitho 2   0:01  0.00% perl5.8.8
 1511 spamd           1   8    0 47624K 43928K exitho 1   0:01  0.00% perl5.8.8
  381 spamd           1   8    0 48472K 44840K exitho 1   0:00  0.00% perl5.8.8
 1674 spamd           1   8    0 50024K 45904K exitho 1   0:00  0.00% perl5.8.8
  385 spamd           1   8    0 47372K 43688K exitho 1   0:00  0.00% perl5.8.8
 1841 spamd           1   8    0 48468K 44780K exitho 3   0:00  0.00% perl5.8.8
 1641 spamd           1   8    0 47328K 43648K exitho 1   0:00  0.00% perl5.8.8


and the current smtp connections:

smtp-proxy03# fgrep status /var/log/smtpd/current | tail -1
@400000004602147f2c00658c tcpserver: status: 18/120

Message from /var/log/messages:
Mar 22 00:26:42 smtp-proxy03 spamd[92332]: prefork: select returned -1!
recovering: Bad file descriptor 
Mar 22 00:27:13 smtp-proxy03 last message repeated 31 times
Mar 22 00:29:14 smtp-proxy03 last message repeated 121 times


After a restart of spamd, everything is back to normal.

Again, take note that this happens on FBSD 6.2 only as far as I can tell and
doesn't affect FBSD 4.11.  I haven't tested FBSD 5.x, cause FBSD 5.x is garbage.

I can provide any additional information as this happens up to something like 10
times a day.  Just let me know whatever information I can provide so that this
can get fixed.  This is a show stopper (almost) ;)

Thanks
Comment 15 Daryl C. W. O'Shea 2007-03-21 23:21:10 UTC
(In reply to comment #14)
> This is not a memory resource problem as Daryl C. W. O'Shea keeps on trying to
> suggest.

I was actually looking for evidence of a timing issue.  I'm pretty sure there
are still race conditions in the preforking code.  Of the dozens of times people
have reported problems with running a large number of children they're almost
always using a GB of two of swap so it's sensible to rule out memory issues
first.  One question about how much RAM, and another to confirm if it's enough
(because on systems I've seen with 4GB of RAM and lots of rules and plugins and
user configs it is tight) isn't as if I'm hung up on it being swap thrashing or
OOM problems.


> Memory usage is light, and the problem happens regardless or load.  Even with 60
> concurrent processes, my setup actually happens to reach a disk limitation with
> high amounts of I/O, rather than consuming too much memory and swapping.  This
> *problem* presents itself on any of the FBSD 6.2 machines, while the 4.11
> machine runs without spamassassin needing to be restarted.

That's useful info.  Not surprising at all either.

Now that we've got info about affected versions of FreeBSD Justin or myself can
actually attempt to track this down.
Comment 16 Paul 2007-03-21 23:46:33 UTC
(In reply to comment #15)
> (In reply to comment #14)
If I came off wrong, I apologize for that ;)  I'm continuing to do some digging
and I've filled out the following report with the BSD team:

http://www.freebsd.org/cgi/query-pr.cgi?pr=110657

Obviously the changes between 4.11 and 6.2 are significant, and thus far the
only real thing I can see in the cleanup code for BSD 6.2 is:

Lines 173-197:sys/kern/kern_exit.c
while (p->p_lock > 0)
msleep(&p->p_lock, &p->p_mtx, PWAIT, "exithold", 0);

Notice the "exithold"?   That's the state of the processes when this happens. 
Not sure if the above has any relevance, but it may....just wanted to point it
out...

Thanks Man
Comment 17 Justin Mason 2007-03-22 03:14:17 UTC
Paul, does the --round-robin switch fix the issue on your machines?  If so, it's
definitely something in the preforking code; otherwise, I'd suspect somewhere
else (like Bayes, since that's very I/O heavy).
Comment 18 Justin Mason 2007-03-22 03:53:29 UTC
(related: bug 5385 is a separate proposal to re-exec spamd in case of this kind
of tail-spin bug.)
Comment 19 Loren Wilton 2007-03-22 06:21:00 UTC
Wasn't there something about potential problems on multiprocessor system in 
Perl or SA?  His problems are only showing up on the MP systems, I think.
Comment 20 Justin Mason 2007-03-22 06:35:29 UTC
(In reply to comment #19)
> Wasn't there something about potential problems on multiprocessor system in 
> Perl or SA?  His problems are only showing up on the MP systems, I think.

SA deals with concurrency only at the kernel level, since it (and therefore its
perl interpreter) are not natively using threading.  so it's still up to the
kernel there.
Comment 21 Justin Mason 2007-03-22 06:37:08 UTC
actually, it's not an MP issue. comment 14 states:

'1 x FBSD 4.11 machine, 4Gigs of ram, dual proc [...] This machine never ever
runs into any problems.  It's the other machines that have the problems.  They
are: 4 x FBSD 6.2, 4Gigs of ram, dual proc [...]'

so that's not it; both good and bad machines are MP.
Comment 22 Paul 2007-03-22 08:42:44 UTC
(In reply to comment #21)

Justin, I have currently changed the params on one of the machines to use
--round-robin.  When the machine hangs (if the machine hangs) I'll let you know.
 I'll let you know tonight either way.

>> Wasn't there something about potential problems on multiprocessor system in 
>> Perl or SA?  His problems are only showing up on the MP systems, I think.
All machines that I have described thus far are all SMP systems.  Again, the
only real difference between them being the version of FBSD.
Comment 23 Paul 2007-03-22 08:46:16 UTC
Oh Justin, and since you mentioned bayes because of heavy disk I/O, let me share
my config, as it certainly isn't bayes.  This config is on every machine (4.11
included):
----------------------------------------
----------------------------------------
required_score 4.0

add_header spam Flag _YESNOCAPS_
add_header all Status _YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTS_
autolearn=_AUTOLEARN_ version=_VERSION_
report_safe 1
skip_rbl_checks 1


#   Use Bayesian classifier (default: 1)
#
use_bayes 0
use_auto_whitelist 0

# Always score uribl really high
score URIBL_BLACK 5.0

# blackberrys suck for spamassassin
score LW_STOCK_SPAM4            1.0     
score TVD_FW_GRAPHIC_ID3        0.5
score TVD_FW_GRAPHIC_ID1        0.5
score SARE_GIF_STOX             0.5
score PART_CID_STOCK_LESS       0.5
score FB_PENIS                  0.5
score SARE_RAND_2               0.5
score TVD_FW_MESG1              0.5
score PART_CID_STOCK            1.7
score SUBJ_LIFE_INSURANCE       1.0
score SARE_SUB_TERM_LIFE        1.0

### Once spamassassin 3.2 gets installed, remove these next 3 lines.
score RATWARE_OUTLOOK_NONAME    0.5
score RATWARE_MS_HASH           0.5
score MSGID_DOLLARS             1.0 

### Seeing this get caught up a lot. Reducing score because clients are complaining.
score EXTRA_MPART_TYPE 0.1
Comment 24 Justin Mason 2007-03-22 09:04:48 UTC
ok, so there should be little or no disk I/O from that config.  wierd that
you're seeing that happening!
Comment 25 Paul 2007-03-22 10:28:31 UTC
In regards to the disk I/O, which isn't necessarily the problem ;P.....I have
truss'd the children proccesses and the only things that I can see that seem
"stupid" is the excessive stat'ing that it's doing.  Each children currently
attempt to stat the files more than once.  I guess, the children don't cache the
initial stat they had done.  My comments within '|'.

stat("servers.discovery.lst",0x28171900)         ERR#2 'No such file or directory'
open("/servers.discovery.lst",0x0,0666)          ERR#2 'No such file or directory'
stat("servers.nomination.lst",0x28171900)        ERR#2 'No such file or directory'
open("/servers.nomination.lst",0x0,0666)         ERR#2 'No such file or directory'
stat("servers.catalogue.lst",0x28171900)         ERR#2 'No such file or directory'
open("/servers.catalogue.lst",0x0,0666)          ERR#2 'No such file or directory'

| What exactly are we stat'ing here? |
stat("",0xbfbfea90)                              ERR#2 'No such file or directory'

| We just did these.  Why are we stat'ing them again? |
stat("servers.catalogue.lst",0x28171900)         ERR#2 'No such file or directory'
stat("servers.catalogue.lst",0x28171900)         ERR#2 'No such file or directory'
stat("servers.discovery.lst",0x28171900)         ERR#2 'No such file or directory'
stat("servers.discovery.lst",0x28171900)         ERR#2 'No such file or directory'


stat("/sbin",0xbfbfe930)                         = 0 (0x0)
stat("/bin",0xbfbfe930)                          = 0 (0x0)
stat("/usr/sbin",0xbfbfe930)                     = 0 (0x0)
stat("/usr/bin",0xbfbfe930)                      = 0 (0x0)

| Do we really need to stat /usr/games? |
stat("/usr/games",0xbfbfe930)                    = 0 (0x0)

stat("/usr/local/sbin",0xbfbfe930)               = 0 (0x0)
stat("/usr/local/bin",0xbfbfe930)                = 0 (0x0)
stat("/root/bin",0xbfbfe930)                     = 0 (0x0)


Anyways, this is irrelevant, but did want the mention the children do repeat the
above mutiple times before the child process gets destroyed.  Again, I really
don't want to turn this problem report into a high I/O spamassassin scenerio as
that is less critical than spamassassin just stopping completely.  If the above
is useful, then great, otherwise disregard it altogether.

Thanks Justin
Comment 26 Justin Mason 2007-03-22 10:39:51 UTC
I think all those stat()s should be cached anyway; but the first set is Razor,
which you could try disabling to see if it makes a difference? and the second is
just perl checking each dir in the $PATH.
Comment 27 Justin Mason 2007-03-22 11:21:17 UTC
Created attachment 3886 [details]
patch

BTW, please try applying this patch (should apply ok to 3.1.x).  I have a
feeling it may fix the bug!  it also adds some additional lightweight
instrumentation that may at least produce useful info in the logs to aid in
debugging.
Comment 28 Paul 2007-03-22 11:53:26 UTC
(In reply to comment #27)
> Created an attachment (id=3886) [edit]
> patch
> 
> BTW, please try applying this patch (should apply ok to 3.1.x).  I have a
> feeling it may fix the bug!  it also adds some additional lightweight
> instrumentation that may at least produce useful info in the logs to aid in
> debugging.

Of the 4 machines experiencing this problem:

1 - Has had the patch applied
1 - Has been set to run with "--round-robin"
2 - Have been untouched.

"--round-robin" is a complete memory hog (which I'm sure you know).  Other than
that, it's stable thus far.

I will let you know the results here within the next 24-48 hours.

Thanks Justin!
Comment 29 Paul 2007-03-22 23:26:05 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > Created an attachment (id=3886) [edit] [edit]
> > patch
> > 
> > BTW, please try applying this patch (should apply ok to 3.1.x).  I have a
> > feeling it may fix the bug!  it also adds some additional lightweight
> > instrumentation that may at least produce useful info in the logs to aid in
> > debugging.
> 
> Of the 4 machines experiencing this problem:
> 
> 1 - Has had the patch applied
> 1 - Has been set to run with "--round-robin"
> 2 - Have been untouched.
> 
> "--round-robin" is a complete memory hog (which I'm sure you know).  Other than
> that, it's stable thus far.
> 
> I will let you know the results here within the next 24-48 hours.
> 
> Thanks Justin!


Summary of patch:


Firstly, a snippet from the logs with your patch applied:
[...snip...]
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 61509 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
113. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
221. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313
IO::Socket=GLOB(0x95f9548)/18 removed from selector at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
381. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
221. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 61508 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
113. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
221. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313
IO::Socket=GLOB(0x95f732c)/17 removed from selector at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
381. 
Mar 23 01:08:16 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
221. 
Mar 23 01:08:17 smtp-proxy02 spamd[62786]: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
221. 
[...snip...]

Good news!  The bug has yet to crop up and it's been well over the time period
it normally takes for this bug to crop up.

This particular bug, seems to be fixed with your patch applied.  Certainly if I
experience anything different I will append to this bug report with any further
findings.

In regards to round-robin, (since you had told me to try it) it too seems to fix
the problem as suggested, however be aware that even with 4 gigs of memory I
still at one point ate up all memory and additionally over 1g of swap.  Not a
good solution for high volume filters like I have setup.  This just an
informative message and will be changing from round-robin back to 3.1.8 w/ your
patch applied in the morning. ;)

The other two remaining machines that I have yet to touch will remain as such
until our busiest days see the light of day with your patch applied
(Monday/Tuesday).  Again, if I happen to run into any problems on these given
days, you'll hear from me, otherwise, it seems we are good to go with thus far.

Thanks for the help.
Comment 30 Justin Mason 2007-03-23 04:54:32 UTC
great news so far!  your plan sounds good.

retitling bug to reflect the symptoms.  if this patch works, we can apply to
3.2.0, probably for an rc2 tarball.


Comment 31 Paul 2007-03-23 12:27:54 UTC
I may have spoke too soon.

Mar 23 14:27:18 smtp-proxy02 spamc[60677]: connect(AF_UNIX) to spamd
/tmp/.spamd.sock failed: Connection refused

High loads have brought spamd down completely with your patch, which has never
happened before.  I have restarted the spamd process to see if this is a fluke
or something else.  At this point I'm not sure and will provide additional
information once received.
Comment 32 Justin Mason 2007-03-25 06:16:27 UTC
Paul -- btw, can you post the log output leading up to the spamd crash?
Comment 33 Justin Mason 2007-03-27 11:42:27 UTC
well, some good news from Frank Urban, posted at bug 4590:

'I applied the patch from bug 5313 yesterday morning.
Since this time I didn't see this problem [bug 4590] again. Until yesterday if
have seen it 3 times a day and was only able to stay alive with the "spamd restart 
hack ".

Looks good now...'
Comment 34 Paul 2007-03-27 15:44:34 UTC
Justin, I think everything is fine at this point.  I haven't run into the
problem with spamd crapping out again.  However, per your request, here is the
top <50 lines prior to the crash:

Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock:
00001101111011111111111111111111111111
1111111111111111111111110011010110000000000000000000000000 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam
Assassin/SpamdForkScaling.pm line 147. 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock:
00001101111011111111111111111111111111
1111111111111011111111110011010110000000000000000000000000 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam
Assassin/SpamdForkScaling.pm line 150. 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: killed child 41301 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: child states:
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBIII 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 child_exited 41324 at
/usr/local/lib/perl5/site_perl/
5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 113. 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: spamd: handled cleanup of child pid
41324 due to SIGCHLD 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: oops! no socket for child
41301, killing at /usr/local/l
ib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 422. 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: prefork: killing failed child 41301
fd=undefined at /usr/local/li
b/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 138. 
Mar 23 07:37:19 smtp-proxy02 spamd[62786]: JMD bug5313 deleting sock:
00001101111011111111111111111111111111
1111111111111011111111110011010110000000000000000000000000 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/Spam
Assassin/SpamdForkScaling.pm line 147. 
Mar 23 07:37:20 smtp-proxy02 spamd[62786]: Can't use an undefined value as a
symbol reference at /usr/local/
lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SubProcBackChannel.pm line 109. 
Mar 23 07:37:23 smtp-proxy02 spamd[41271]: spamd: clean message (0.0/4.0) for
qmails:58 in 17.3 seconds, 263
3 bytes. 
Mar 23 07:37:23 smtp-proxy02 spamd[41271]: spamd: result: . 0 -
scantime=17.3,size=2633,user=qmails,uid=58,r
equired_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=/tmp/.spamd.sock,mid=<10703230836.AA13165890@mail16.
datapipe.com>,autolearn=disabled 
Mar 23 07:37:23 smtp-proxy02 spamd[41276]: spamd: identified spam (18.5/4.0) for
qmails:58 in 17.7 seconds, 
9518 bytes. 
Mar 23 07:37:23 smtp-proxy02 spamd[41276]: spamd: result: Y 18 -
HTML_10_20,HTML_FONT_BIG,HTML_MESSAGE,MIME_
HTML_MOSTLY,MPART_ALT_DIFF,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,UNPARSEABLE_RELAY,U
RIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL
scantime=17.7,size=9518,user=qmails,uid
=58,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=/tmp/.spamd.sock,mid=<2600601211.20070323123540
@greatd.com>,autolearn=disabled 
Mar 23 07:37:26 smtp-proxy02 spamc[41836]: connect(AF_UNIX) to spamd
/tmp/.spamd.sock failed: Connection ref
used
Comment 35 Paul 2007-03-27 15:59:33 UTC
BTW,

If you need me to by chance post the entire log for the day somewhere, I
certainly can make those arrangements.
Comment 36 Justin Mason 2007-03-27 16:41:06 UTC
Created attachment 3890 [details]
patch rev 2

I was able to reproduce that crash with some additional test instrumentation;
it was caused by an attempt to call fileno() on an undef filehamdle. This patch
is a revision that avoids it; could you try applying this in place of patch
3886?
Comment 37 Paul 2007-03-27 16:47:11 UTC
Patch has been applied.

Will keep you updated regarding any failures.
Comment 38 Justin Mason 2007-03-28 11:10:08 UTC
Created attachment 3891 [details]
patch rev 3

urgh.  When auditing that code, I spotted another bug -- instead of die()ing
when writes fail persistently, it was calling write() in a freudian-slip style,
which was essentially a no-op, I think.

This meant that some required error-recovery code would never happen, and I
think it was possible to wind up with child processes seeming to get stuck in
an idle state.	This patch is a replacement (again) which fixes that issue,
too.

I think this would be irrelevant to the current bug, but it's definitely an
important fix that may be related.

so guys -- could you update the test systems with *this* patch?  sorry ;)
Comment 39 Justin Mason 2007-03-29 12:25:44 UTC
anyone run into any failures yet? ;)
Comment 40 Paul 2007-03-29 15:55:03 UTC
Nothing yet, but will certainly report something when/if I find something.
Comment 41 Justin Mason 2007-04-02 09:58:25 UTC
I take it from the silence that this code seems to be working pretty well? ;)
good news, if so.
Comment 42 Justin Mason 2007-04-03 11:37:50 UTC
OK, given that no news is good news, I've applied this code to trunk:

: jm 50...; svn commit -m "bug 5313: fix spamd 'prefork: select returned -1,
recovering: Bad file descriptor' error.  do this by ensuring FDs are fully
cleaned up (and removed from the selector bitvec) if a child is killed, and a
few other related cleanups, to fully clarify the code.  also, fix a case where
write errors from child to parent were not being caught or logged, due to
incorrect use of write() instead of die().  finally, add test instrumentation
code, allowing simulation of certain reported system errors.  Note: this checkin
also contains additional debugging log messages"
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Sending        lib/Mail/SpamAssassin/SubProcBackChannel.pm
Transmitting file data ..
Committed revision 525237.
: exit=0 Tue Apr  3 19:34:23 IST 2007; cd /home/jm/ftp/spamassassin
: jm 51...; svn commit -m "remove bug 5313 debugging log messages"
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Transmitting file data .
Committed revision 525238.


(two separate commits, so that the debugging messages can be reenabled simply by
reversing commit 525238.)

resolving, optimistically.
Please comment if you have applied the patches and see anything odd/worrying...
Comment 43 Michael Naef 2007-04-04 02:10:22 UTC
Hi all

[..]
> so guys -- could you update the test systems with *this* patch?  sorry ;)

One of our spamds with this patch applied crashed last night. The log says:

Apr  3 23:28:45 mxin001 spamd[75510]: spamd: result: Y 12 -
DCC_CHECK,DIGEST_MULTIPLE,FROM_LOCAL_NOVOWEL,HTML_MESSAGE,MIME_HTML_ONLY,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_IN_XBL
scantime=4.6,size=16618,user=info@eg***.ch,uid=0,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=57784,mid=<B09A8404.722DE83@fastmail.fm>,autolearn=disabled

Apr  3 23:28:45 mxin001 spamd[75890]: spamd: identified spam (28.9/5.0) for
waelchli@la***.ch:0 in 5.1 seconds, 3187 bytes. 
Apr  3 23:28:45 mxin001 spamd[75890]: spamd: result: Y 28 -
BODY_ENHANCEMENT2,DCC_CHECK,DIGEST_MULTIPLE,DRUGS_DIET,DRUGS_ERECTILE,DRUGS_MANYKINDS,DRUGS_MUSCLE,DRUGS_PAIN,DRUG_ED_GENERIC,HG_HORMONE,HTML_MESSAGE,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,UNPARSEABLE_RELAY,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL
scantime=5.1,size=3187,user=waelchli@la***.ch,uid=0,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=57786,mid=<0982154613.20070403212835@gotname.com>,autolearn=disabled

Apr  3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 527. 
Apr  3 23:28:46 mxin001 spamd[75445]: spamd: identified spam (13.0/2.0) for
katharina@be***.ch:0 in 5.5 seconds, 2425 bytes. 
Apr  3 23:28:46 mxin001 spamd[75445]: spamd: result: Y 12 -
HTML_FONT_BIG,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,SPF_HELO_SOFTFAIL
scantime=5.5,size=2425,user=katharina@be***.ch,uid=0,required_score=2.0,rhost=localhost,raddr=127.0.0.1,rport=57777,mid=<546e663w.3248955@rider.edu>,autolearn=disabled

Apr  3 23:28:46 mxin001 spamd[75445]: syswrite() to parent failed: Broken pipe
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 527. 

If I look for such calls i find:

argus mail $ bzgrep mxin001 spamd.0.bz2 | grep 'syswrite()'
Apr  3 17:08:41 mxin001 spamd[30071]: syswrite() on closed filehandle GEN7 at
/usr/local/lib/perl5/5.8.8/mach/IO/Handle.pm line 453. 
Apr  3 21:16:53 mxin001 spamd[58574]: syswrite() on closed filehandle GEN34885
at /usr/local/lib/perl5/5.8.8/mach/IO/Handle.pm line 453. 
Apr  3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 527. 
Apr  3 23:28:46 mxin001 spamd[75445]: syswrite() to parent failed: Broken pipe
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 527. 

The system is a FreeBSD 6.2-RELEASE with 8 Intel cores and 4 gigs of RAM.
SpamAssassin is

SpamAssassin version 3.1.8
running on Perl version 5.8.8

Michael
Comment 44 Justin Mason 2007-04-04 04:00:36 UTC
(In reply to comment #43)

> Apr  3 23:28:45 mxin001 spamd[75890]: syswrite() to parent failed: Broken pipe
> at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
> line 527. 

Did this interfere with scanning, or did the spamd recover from the error
condition OK?  what should have happened is that the child spamd died, and the
parent forked a new one to replace it.
Comment 45 Justin Mason 2007-04-04 04:08:00 UTC
(oh btw, if you could attach (as attachment) a full log leading up to those
errors, that'd be cool too. I'm curious as to what could be causing them...)
Comment 46 Michael Naef 2007-04-04 06:03:48 UTC
(In reply to comment #44)
[..]
> Did this interfere with scanning, or did the spamd recover from the error
> condition OK?  what should have happened is that the child spamd died, and the
> parent forked a new one to replace it.

No, it didn't recover. This was the last message from spamd until I restarted it.

> (oh btw, if you could attach (as attachment) a full log leading up to those
> errors, that'd be cool too. I'm curious as to what could be causing them...)

You don't really want that, it's a 1 gig log from yesterday... Besides I cannot
provide it to you due to data protection regulations... If you have specific
parts of interests I will sift the messages out for you.

Michael
Comment 47 Justin Mason 2007-04-04 07:21:41 UTC
> No, it didn't recover. This was the last message from spamd until I restarted it.

not so good :(

Here's the thing -- I would suspect that a "broken pipe" error is indicative of
the parent spamd process *already* being dead, and hence breaking the pipes it
had open to the children.  So it's a side-effect, not a symptom.

> > (oh btw, if you could attach (as attachment) a full log leading up to those
> > errors, that'd be cool too. I'm curious as to what could be causing them...)
> 
> You don't really want that, it's a 1 gig log from yesterday... Besides I cannot
> provide it to you due to data protection regulations... If you have specific
> parts of interests I will sift the messages out for you.

can you replace all these lines:

Apr  3 23:28:46 mxin001 spamd[75445]: spamd: identified spam (13.0/2.0) for
katharina@be***.ch:0 in 5.5 seconds, 2425 bytes. 
Apr  3 23:28:46 mxin001 spamd[75445]: spamd: result: Y 12 -
HTML_FONT_BIG,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,SPF_HELO_SOFTFAIL
scantime=5.5,size=2425,user=katharina@be***.ch,uid=0,required_score=2.0,rhost=localhost,raddr=127.0.0.1,rport=57777,mid=<546e663w.3248955@rider.edu>,autolearn=disabled

with non-identifying obfuscation, e.g. "[identified spam line]", "[result
line]", "[processing message]", and therefore be able to pass it on?

basically I don't need to see the normal lines of spamd operation, just the
errors plus any progress report info from the prefork subsystem, connection
opens/closes, forking, that kind of thing.  

Alternatively, stated as a "what to sift": here's the kind of thing I want:

[19766] info: spamd: connection from localhost [127.0.0.1] at port 59785
[19762] info: prefork: child states: BB
[19762] info: spamd: server successfully spawned child process, pid 19784
[19762] info: prefork: server reached --max-children setting, consider raising it
[19762] info: spamd: server killed by SIGTERM, shutting down
[19762] info: spamd: handled cleanup of child pid 20046 due to SIGCHLD

plus any errors.

also, I don't need the full log, maybe the most recent 1000 lines I think...

thanks ;)
Comment 48 Michael Naef 2007-04-05 03:05:53 UTC
(In reply to comment #47)
[..]
> Here's the thing -- I would suspect that a "broken pipe" error is indicative of
> the parent spamd process *already* being dead, and hence breaking the pipes it
> had open to the children.

Thats what I think, too.

[..]

> also, I don't need the full log, maybe the most recent 1000 lines I think...

I replaced all email-like parts. You can get the last lines from

ftp://ftp.linu.gs/tmp/spamd.out.gz

Michael
Comment 49 Justin Mason 2007-04-05 04:52:41 UTC
thanks for that log. very odd!

it seems to go like this:

Apr  3 23:28:32 mxin001 spamd[58574]: spamd: server successfully spawned child
process, pid 75698
Apr  3 23:28:32 mxin001 spamd[75698]: spamd: connection from localhost
[127.0.0.1] at port 57715
Apr  3 23:28:33 mxin001 spamd[75698]: spamd: processing message <[...snip...]>
for [...snip...]:0
Apr  3 23:28:41 mxin001 spamd[75698]: spamd: identified spam (8.5/4.0) for
[...snip...]:0 in 9.3 seconds, 22879 bytes.
Apr  3 23:28:41 mxin001 spamd[75698]: spamd: result: Y 8 -
DCC_CHECK,DIGEST_MULTIPLE,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_IN_XBL,TRACKER_ID
scantime=9.3,size=22879,user=[...snip...],uid=0,required_score=4.0,rhost=localhost,raddr=127.0.0.1,rport=57715,mid=<[...snip...]>,autolearn=disabled

So far so good.  Then the error condition from bug 4590 occurs:

Apr  3 23:28:41 mxin001 spamd[58574]: prefork: ordered child 75698 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 481.

And recovery takes place:

Apr  3 23:28:41 mxin001 spamd[58574]: prefork: killing failed child 75698 fd=12
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 144.
Apr  3 23:28:41 mxin001 spamd[58574]: JMD bug5313 deleting sock:
0000011011111111111111110101111110000100000000000000000000000000000000000000000000000000
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 153.
Apr  3 23:28:41 mxin001 spamd[58574]: JMD bug5313 deleting sock:
0000011011110111111111110101111110000100000000000000000000000000000000000000000000000000
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 160.
Apr  3 23:28:41 mxin001 spamd[58574]: prefork: killed child 75698
Apr  3 23:28:41 mxin001 spamd[58574]: JMD bug5313 child_exited 75698 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 119.
Apr  3 23:28:41 mxin001 spamd[58574]: spamd: handled cleanup of child pid 75698
due to SIGCHLD

That's all correct. then this:

Apr  3 23:28:41 mxin001 spamd[58574]: Argument "CHLD" isn't numeric in numeric
eq (==) at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 713.

A total mystery!  as far as I can see there's no way that this string could get
into the {kids} hash that that line examines!  Even if it does, it should
produce a "?" in this line:

Apr  3 23:28:41 mxin001 spamd[58574]: prefork: child states: BBKBBBBBBBBBBBBBBBBSBB
BBB

As you can see, no "?".  Regardless, it should be harmless; the spamd carries on:

Apr  3 23:28:41 mxin001 spamd[58574]: spamd: server successfully spawned child
process, pid 75913

But that is the last line from the parent spamd (pid 58574) until the
"syswrite() to parent failed: Broken pipe" errors at EOF.  In other words, the
spamd definitely died, silently, some time after that "successful spawn" line.
Odd.  I'd have to guess something's really wrong in the perl interpreter. :(


I do notice that that log contains *lots* of 'ordered child 68857 to accept,
but they reported state '1', killing rogue' messages.  I was under the
impression this was relatively rare, but you seem to get this happening every
15-20 seconds!  Is it normally this frequent?   If so, the next time it starts
happening that frequently, any chance I could get an strace log of the spamd
parent and child processes, along with a simultaneous copy of the syslog
similar to this one?  

It'd be *very* useful to get to the bottom of this "state '1'" mystery, which
I've been unable to reproduce here so far, and that might help avoid the whole
situation...
Comment 50 Michael Naef 2007-04-05 08:57:33 UTC
> I do notice that that log contains *lots* of 'ordered child 68857 to accept,
> but they reported state '1', killing rogue' messages.  I was under the
> impression this was relatively rare, but you seem to get this happening every
> 15-20 seconds!  Is it normally this frequent?   If so, the next time it starts
> happening that frequently, any chance I could get an strace log of the spamd
> parent and child processes, along with a simultaneous copy of the syslog
> similar to this one?  

The 'killing rogue' messages kept appearing, alhough in decreasing frequency. At
the moment they have gone for all. I will keep monitoring (have to :) and supply
the your rquested information as soon as it starts happening again. Spamd hasn't
crashed yet until the last failure. Interestingly the other servers (same job,
different type of hardware) don't show these symptoms.

Thanks a lot for your work and have a great easter, Michael
Comment 51 Jeremy Kister 2007-04-06 02:03:51 UTC
patch rev 3 applied clean and all compiled fine

But now, i get a different warning:

2007-04-06 04:49:12 [99625] warn: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 219.
2007-04-06 04:49:14 [99625] warn: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 219.
2007-04-06 04:49:16 [99625] warn: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 219.

I'm getting the warning every two seconds, but spamd
does seem to be working.
Comment 52 Justin Mason 2007-04-06 02:35:05 UTC
Jeremy, if you read this bug ;) you'll notice that I mentioned 'it also adds
some additional lightweight instrumentation that may at least produce useful
info in the logs to aid in debugging.'

That's what the "JMD" noise is -- totally harmless.
Comment 53 Jeremy Kister 2007-04-06 09:26:32 UTC
thanks for the reply justin,


I realized it was debugging info.  I didn't realize I had children exiting 
every two seconds under no load.

Thanks again,
Comment 54 Justin Mason 2007-04-08 13:56:59 UTC
> I realized it was debugging info.  I didn't realize I had children exiting 
> every two seconds under no load.

they're not. read the log again. ;) --

2007-04-06 04:49:16 [99625] warn: JMD bug5313 child_just_exited = 0 at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
line 219.

So every 2 seconds a log message appears noting that the boolean var
"child_just_exited" is set to 0, ie. false.  In other words, no child has
exited. seriously, don't worry about it.
Comment 55 Justin Mason 2007-04-10 04:09:18 UTC
any luck, anyone?  comments either way ("works for me" or not) would be welcome,
just to get an idea of how well this fix is working in the field....
Comment 56 Michael Naef 2007-04-11 08:13:25 UTC
(In reply to comment #55)
> any luck, anyone?  comments either way ("works for me" or not) would be welcome,
> just to get an idea of how well this fix is working in the field....

We had no more crashes for about 6 days now. It seems to work quite well. Also
the  messages "prefork: killing rogue child 3490, failed to write on fd" have
allmost fully dispeared (Today were three of them).
Comment 57 sandys 2007-04-11 09:09:34 UTC
I put this patch in place about a day after you submitted the final version 
and took the --round-robin parameter off spamd.  It's been working perfectly 
ever since.  Before we were seeing this issue about once a day, so from our 
standpoint the patch seems to be working well.
Comment 58 Frank Urban 2007-04-11 10:11:28 UTC
what does this mean? Only informational?

Apr 11 19:08:14 mail1 spamd[10411]: JMD bug5313 child_just_exited = 0 at
/usr/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line 219.
Comment 59 Justin Mason 2007-04-11 10:23:28 UTC
purely debugging; it'll be removed for the "finished" version of the fix.  it
means no child has exited, iirc.
Comment 60 Mike 2007-04-11 10:50:34 UTC
Does anyone have an estimate when the next verision (3.2.0) will be released
with this patch included? 

Thanks.
Comment 61 Justin Mason 2007-04-11 11:19:58 UTC
(In reply to comment #60)
> Does anyone have an estimate when the next verision (3.2.0) will be released
> with this patch included? 

I'm hoping a week or two...
this is about the last major bug that is holding up 3.2.0 release candidates. 
The more "this is good" comments I get, the more likely it is to go quicker ;)

Comment 62 Justin Mason 2007-04-12 03:36:48 UTC
lowering priority and resolving... I think this is looking good, now, since
nobody has commented that they are still running into problems with the patch
applied.
Comment 63 Michael Naef 2007-04-18 00:33:02 UTC
I was finaly able to capture truss (strace) output while spamd ecountered
'killing rogue's again. There were several periods in the last days when this
error kept appearing more frequently.

Apr 18 09:12:53 mxin001 spamd[52389]: prefork: ordered child 8598 to accept, but
they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:13:53 mxin001 spamd[52389]: prefork: ordered child 10788 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:14:05 mxin001 spamd[52389]: prefork: ordered child 11129 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:14:17 mxin001 spamd[52389]: prefork: ordered child 11507 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:14:34 mxin001 spamd[52389]: prefork: ordered child 11960 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:15:06 mxin001 spamd[52389]: prefork: ordered child 12333 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 
Apr 18 09:15:19 mxin001 spamd[52389]: prefork: ordered child 13274 to accept,
but they reported state '1', killing rogue at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line
481. 

I put the 3mb truss output of the parent process on
ftp://ftp.linu.gs/tmp/spamd.truss.out.gz

Michael
Comment 64 Justin Mason 2007-04-18 09:08:54 UTC
thanks Michael -- that was very useful!  I have a better idea what may be
happening now.  Could you take a look at bug 5422:
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5422 and try the patch there?
Comment 65 Justin Mason 2007-04-30 09:03:33 UTC
Created attachment 3929 [details]
for reference; patch rev 3, redone without debugging statements and against 3.1.x HEAD