Bug 4696 - [review] tcp timeout -> SpamdForkScaling.pm line 195
Summary: [review] tcp timeout -> SpamdForkScaling.pm line 195
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.1.0
Hardware: All Linux
: P1 normal
Target Milestone: 3.1.1
Assignee: Justin Mason
URL:
Whiteboard:
Keywords:
: 4563 (view as bug list)
Depends on:
Blocks: 4476 4590
  Show dependency tree
 
Reported: 2005-11-21 16:08 UTC by Greg
Modified: 2006-03-09 11:54 UTC (History)
7 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
mail log traces w/ -Dprefork application/x-zip-compressed None George R. Kasica [NoCLA]
possible fix patch None Justin Mason [HasCLA]
better fix patch None Justin Mason [HasCLA]
further fixing patch None Justin Mason [HasCLA]
silly further fix patch None Justin Mason [HasCLA]
aggregated patch against b3_1_0 svn branch patch None Justin Mason [HasCLA]
aggregated patch against b3_1_0 svn branch take 2 patch None Justin Mason [HasCLA]
aggregated patch, against the 3.1.0 release tarball patch None Justin Mason [HasCLA]
aggregated patch, against the 3.1.0 release tarball, take 2 patch None Justin Mason [HasCLA]
aggregated patch against b3_1_0 svn branch take 3 patch None Justin Mason [HasCLA]
aggregated patch against the 3.1.0 release tarball take 3 patch None Justin Mason [HasCLA]
log text/plain None Gambit [NoCLA]
more logs text/plain None Gambit [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Greg 2005-11-21 16:08:38 UTC
Excuse my bad english, I am French...

I upgrade SA to 3.1.0 and now spamd died...
I search the same bug everywhere and I don't find it.

I write a sh script to restart automaticaly spamd, but it's not a good solution...

This is the seven line of log : 

Nov 21 05:35:30 ns2299 spamd[1153]: prefork: child states: II 
Nov 21 05:35:53 ns2299 spamd[1153]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Nov 21 05:35:55 ns2299 spamd[1153]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Nov 21 05:35:55 ns2299 spamd[1153]: prefork: select returned undef! recovering 
Nov 21 05:37:37 ns2299 spamd[1153]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Nov 21 05:37:41 ns2299 spamd[1153]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Nov 21 05:38:54 ns2299 spamc[19745]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#1 of 3): Connection refused
Comment 1 Greg 2005-11-22 15:38:35 UTC
The same problem again : 
Nov 22 05:54:18 ns2299 spamd[12426]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 

I saw this problem since twelve days always between 5:30 and 5:59 !
Very strange...

My configuration : RedHat server, Kernel 2.4 with grsec

Greg
Comment 2 seb 2005-12-08 13:52:46 UTC
Same problem here, occured only once so far

Dec  5 00:15:04 fedora spamd[29470]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Dec  5 00:15:04 fedora spamd[29470]: tcp timeout at
/usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.

Dec  5 00:28:22 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#1 of 3): Connection refused
Dec  5 00:28:23 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#2 of 3): Connection refused
Dec  5 00:28:24 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#3 of 3): Connection refused


CentOS 3.6 kernel 2.4.21-37.EL
Comment 3 Dallas Engelken 2005-12-08 15:09:01 UTC
this may be related to bug 4693 - as they both involve the 'tcp timeout' being 
triggered... just slightly different outcomes.
Comment 4 Greg 2005-12-08 20:13:28 UTC
Hello I don't think it's the same problem. In fact the outcome is different...
I'm very embarrassed with this bug... I must restart spamd every day...(In reply
to comment #3)
> this may be related to bug 4693 - as they both involve the 'tcp timeout' being 
> triggered... just slightly different outcomes.

Comment 5 Sidney Markowitz 2005-12-08 20:20:43 UTC
> I saw this problem since twelve days always between 5:30 and 5:59 !

Do you have a cleaning person who unplugs a router to plug in the floor cleaner
every day? :-)

> Very strange...

Yes it is. Do you get the same results if you run spamd with the --round-robin
option?
Comment 6 seb 2005-12-08 20:32:40 UTC
indeed, I don't see any correlation with bug 4693

I run SA 3.1.0 since it's release date, it had been running for about 45 days
when the crash occured.

I didn't make any change/update to perl or anything else regarding postfix or
spamassassin in a while.. imho it's a spamassassin bug

(In reply to comment #4 and #5)
> Hello I don't think it's the same problem. In fact the outcome is different...
> I'm very embarrassed with this bug... I must restart spamd every day...(In reply
> to comment #3)
> > this may be related to bug 4693 - as they both involve the 'tcp timeout' being 
> > triggered... just slightly different outcomes.
> 
> 
Comment 7 Justin Mason 2005-12-20 02:56:10 UTC
I don't think these are related to the other "tcp timeout" bug.

Most often, when spamd dies at the same time every day or every week, in the
middle of the night, I would tend to suspect syslog cycling.  Spamd needs a
"kill -1" when syslogd is restarted...

If that's not it, I would suggest trying to capture some strace output; strace
-fp of the process that issues the warning.
Comment 8 Tuomas Silen 2005-12-29 14:23:47 UTC
(In reply to comment #7)
> I don't think these are related to the other "tcp timeout" bug.
> 
> Most often, when spamd dies at the same time every day or every week, in the
> middle of the night, I would tend to suspect syslog cycling.  Spamd needs a
> "kill -1" when syslogd is restarted...
> 

Very unlikely. For me this happened at the middle of the day after spamd running
for several weeks.

Here's my logs:
Signal 11 was received for spamd[4481] at 12:06:22

Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBBIBB
Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIIBBKBB
Dec 29 12:06:18 spamd[4481]: spamd: handled cleanup of child pid 8868 due to SIGCHLD
Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBKBB
Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBKIB
Dec 29 12:06:18 spamd[4481]: prefork: child states: BIIIBBKKB
Dec 29 12:06:18 spamd[4481]: spamd: handled cleanup of child pid 27700 due to
SIGCHLD
Dec 29 12:06:18 spamd[4481]: prefork: child states: BIIKBBKB
Dec 29 12:06:22 spamd[4481]: tcp timeout at
/usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Dec 29 12:06:22 spamd[4481]: tcp timeout at
/usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Dec 29 12:06:22 spamd[4481]: spamd: handled cleanup of child pid 6885 due to SIGCHLD
Dec 29 12:06:22 spamd[4481]: spamd: handled cleanup of child pid 22408 due to
SIGCHLD
Dec 29 12:06:22 spamd[4481]: prefork: child states: BIIBBB
Dec 29 12:06:24 spamd[30389]: spamd: clean message (-3.9/4.8) for ... in 7.7
seconds, 5490 bytes.
Dec 29 12:06:24 spamd[19326]: spamd: clean message (-3.9/4.8) for ... in 7.9
seconds, 5487 bytes.
Dec 29 12:06:24 spamd[1513]: spamd: clean message (-3.9/4.8) for ... in 7.5
seconds, 3883 bytes.
Dec 29 12:06:25 spamd[9107]: spamd: clean message (-4.2/4.8) for ... in 7.8
seconds, 3331 bytes.
Dec 29 12:06:25 spamc[9210]: connect(AF_UNIX) to spamd spamd.socket failed:
Connection refused
Comment 9 George R. Kasica 2006-01-10 06:06:01 UTC
Similar problem here with it dying at random during the day. No
recycling of logs or anything here. Linux kernel 2.6.12.2 with SA 3.1
and Exim 3.36

See below log piece I captured from a start at 18:25 until it died on
its own at 1853

Jan  9 18:22:45 eagle spamd[18307]: logger: removing stderr method  
Jan  9 18:25:09 eagle dccifd[18383]: write(MTA socket,89): Broken pipe
Jan  9 18:25:24 eagle spamd[18313]: spamd: server started on port
783/tcp (runni
ng version 3.1.0)   
Jan  9 18:25:24 eagle spamd[18313]: spamd: server pid: 18313   
Jan  9 18:25:24 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18414  
Jan  9 18:25:24 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18415  
Jan  9 18:25:24 eagle spamd[18414]: spamd: connection from localhost
[127.0.0.1]
 at port 59378   
Jan  9 18:25:24 eagle spamd[18414]: spamd: processing message
<43c2fe5f@happypat
tern.com> for mail:561  
Jan  9 18:25:24 eagle spamd[18313]: prefork: child states: BI
Jan  9 18:25:24 eagle spamd[18415]: spamd: connection from localhost
[127.0.0.1]
 at port 59381
Jan  9 18:25:24 eagle spamd[18415]: spamd: processing message
<o4vD3rkVl.v3YxK2d
9Af.nqkiRY1@spray.se> for mail:561
Jan  9 18:25:39 eagle spamd[18414]: spamd: identified spam (122.9/5.0)
for mail:
561 in 15.3 seconds, 5772 bytes.
Jan  9 18:25:40 eagle spamd[18414]: spamd: result: Y 122 -
DCC_CHECK,DIGEST_MULT
IPLE,FRONTPAGE,HTML_MESSAGE,HTML_TAG_BALANCE_BODY,RAZOR2_CF_RANGE_51_100,RAZOR2_
CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_EN_A_9XX_1,SARE_H
TML_HTML_BEFORE,SARE_SXLIFE,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_WS_SURBL,USER_IN
_BLACKLIST,WLS_URI_OPT_1395
scantime=15.3,size=5772,user=mail,uid=561,required_s
core=5.0,rhost=localhost,raddr=127.0.0.1,rport=59378,mid=<43c2fe5f@happypattern.
com>,autolearn=disabled
Jan  9 18:25:50 eagle spamd[18414]: __alarm__
Jan  9 18:25:50 eagle spamd[18414]: __alarm__
Jan  9 18:25:50 eagle spamd[18313]: prefork: child states: BB
Jan  9 18:25:50 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18433
Jan  9 18:25:50 eagle spamd[18313]: prefork: child states: BBB
Jan  9 18:25:50 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18434
Jan  9 18:25:50 eagle spamd[18313]: prefork: child states: BBBI
Jan  9 18:25:50 eagle spamd[18433]: spamd: connection from localhost
[127.0.0.1]
 at port 59385
Jan  9 18:25:50 eagle spamd[18433]: spamd: processing message
<Xzugd31hP.N12uh2t
D.iK9niF@postmaster.co.uk> for mail:561
Jan  9 18:25:50 eagle spamd[18414]: spamd: connection from localhost
[127.0.0.1]
 at port 59384
Jan  9 18:25:50 eagle spamd[18414]: spamd: processing message
<JH4gKuqgma.dPnxot
NlAS.SVXzzQw@mweb.com.cn> for mail:561
Jan  9 18:26:13 eagle spamd[18434]: spamd: connection from localhost
[127.0.0.1]
 at port 46548
Jan  9 18:26:13 eagle spamd[18434]: spamd: processing message
<000001c615a6$a397
6400$0100007f@localhost> for mail:561
Jan  9 18:26:32 eagle spamd[18415]: spamd: identified spam (25.1/5.0)
for mail:5
61 in 67.9 seconds, 772 bytes.
Jan  9 18:26:32 eagle spamd[18415]: spamd: result: Y 25 -
DCC_CHECK,DIGEST_MULTI
PLE,FROM_LOCAL_NOVOWEL,INFO_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100
,RAZOR2_CHECK,SARE_URI_RAW_ONLY,URIBL_JP_SURBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI
_NO_WWW_INFO_CGI,WLS_URI_OPT_1475
scantime=67.9,size=772,user=mail,uid=561,requi
red_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=59381,mid=<o4vD3rkVl.v3YxK2d
9Af.nqkiRY1@spray.se>,autolearn=disabled
Jan  9 18:26:50 eagle spamd[18415]: __alarm__
Jan  9 18:26:50 eagle spamd[18415]: __alarm__
Jan  9 18:26:54 eagle spamd[18313]: prefork: child states: BBBB
Jan  9 18:26:54 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18464
Jan  9 18:26:54 eagle spamd[18313]: prefork: child states: BBBBI
Jan  9 18:26:54 eagle spamd[18415]: spamd: connection from localhost
[127.0.0.1]
 at port 46551
Jan  9 18:26:55 eagle spamd[18415]: spamd: processing message
<001501c6157c$2a66
6740$8200a8c0@David> for mail:561
Jan  9 18:27:12 eagle spamd[18414]: spamd: identified spam (32.5/5.0)
for mail:5
61 in 82.2 seconds, 750 bytes.
Jan  9 18:27:12 eagle spamd[18414]: spamd: result: Y 32 -
INFO_TLD,RAZOR2_CF_RAN
GE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_URI_RAW_ONLY,URIBL_AB_SURB
L,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NO_W
WW_INFO_CGI,WLS_URI_OPT_3272
scantime=82.2,size=750,user=mail,uid=561,required_s
core=5.0,rhost=localhost,raddr=127.0.0.1,rport=59384,mid=<JH4gKuqgma.dPnxotNlAS.
SVXzzQw@mweb.com.cn>,autolearn=disabled
Jan  9 18:27:23 eagle spamd[18414]: __alarm__
Jan  9 18:27:23 eagle spamd[18414]: __alarm__
Jan  9 18:27:29 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:27:29 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:27:29 eagle spamd[18313]: prefork: select returned undef!
recovering
Jan  9 18:27:41 eagle spamd[18433]: spamd: identified spam (26.4/5.0)
for mail:5
61 in 110.7 seconds, 805 bytes.
Jan  9 18:27:41 eagle spamd[18433]: spamd: result: Y 26 -
DCC_CHECK,FROM_LOCAL_N
OVOWEL,INFO_TLD,SARE_URI_RAW_ONLY,URIBL_JP_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_
WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_1475
scantime=110.7,size=805,user=mail,
uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=59385,mid=<Xzug
d31hP.N12uh2tD.iK9niF@postmaster.co.uk>,autolearn=disabled
Jan  9 18:27:47 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:27:47 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:27:47 eagle spamd[18313]: prefork: select returned undef!
recovering
Jan  9 18:27:48 eagle spamd[18313]: prefork: child states: IBBBI
Jan  9 18:27:51 eagle spamd[18414]: spamd: connection from localhost
[127.0.0.1]
 at port 46555
Jan  9 18:27:53 eagle spamd[18414]: spamd: processing message
<191690-2200612100
271598@PSU.EDU> for mail:561
Jan  9 18:27:57 eagle spamd[18433]: __alarm__
Jan  9 18:27:57 eagle spamd[18433]: __alarm__
Jan  9 18:27:57 eagle spamd[18313]: prefork: child states: BBIBI
Jan  9 18:28:25 eagle spamd[18433]: spamd: connection from localhost
[127.0.0.1]
 at port 46556
Jan  9 18:28:26 eagle spamd[18433]: spamd: processing message
<001101c60f1c$66c0
f360$0200a8c0@pc> for mail:561
Jan  9 18:28:42 eagle spamd[18434]: spamd: identified spam (38.4/5.0)
for mail:5
61 in 149.0 seconds, 14763 bytes.
Jan  9 18:28:43 eagle spamd[18434]: spamd: result: Y 38 -
DCC_CHECK,DIGEST_MULTI
PLE,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_
51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RM_bpoem_InstantDL,SARE_OEM_AND_OT
HER,SARE_OEM_A_1,SARE_OEM_NEW_TITLES,SARE_OEM_PRODS_1,SARE_OEM_PRODS_FEW,SARE_OE
M_S_PRICE,SARE_OEM_UPPER_EYE,SARE_PRODS_LOTS,SARE_PRODUCTS_02,SARE_PRODUCTS_03,S
ARE_PRODUCTS_04,SARE_SUB_CHEAP_SW,URIBL_JP_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_
WS_SURBL,WLS_URI_OPT_2170
scantime=149.0,size=14763,user=mail,uid=561,required_s
core=5.0,rhost=localhost,raddr=127.0.0.1,rport=46548,mid=<000001c615a6$a3976400$
0100007f@localhost>,autolearn=disabled
Jan  9 18:28:59 eagle spamd[18464]: spamd: connection from localhost
[127.0.0.1]
 at port 46559
Jan  9 18:29:24 eagle spamd[18434]: __alarm__
Jan  9 18:29:28 eagle spamd[18434]: __alarm__
Jan  9 18:29:28 eagle spamd[18313]: prefork: child states: BBBIB
Jan  9 18:29:53 eagle spamd[18464]: spamd: processing message
<dKqBnX.squirrel@2
18.208.51.3> for mail:561
Jan  9 18:30:35 eagle spamd[18433]: spamd: identified spam (47.5/5.0)
for mail:5
61 in 130.4 seconds, 3461 bytes.
Jan  9 18:30:37 eagle spamd[18433]: spamd: result: Y 47 -
DATE_IN_PAST_06_12,DCC
_CHECK,DIGEST_MULTIPLE,DRUGS_ERECTILE,DRUG_DOSAGE,DRUG_ED_CAPS,FORGED_YAHOO_RCVD
,HTML_FONT_BIG,HTML_FONT_SIZE_LARGE,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,INFO_TLD,M
SGID_OUTLOOK_INVALID,NO_REAL_NAME,ONLINE_PHARMACY,RAZOR2_CF_RANGE_51_100,RAZOR2_
CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_SUB_AS_LOW_AS,SAR
E_SUPERVIAGRA,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SU
RBL,URIBL_WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_1744
scantime=130.4,size=3461
,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=4655
6,mid=<001101c60f1c$66c0f360$0200a8c0@pc>,autolearn=disabled
Jan  9 18:30:40 eagle spamd[18434]: spamd: connection from localhost
[127.0.0.1]
 at port 46566
Jan  9 18:30:47 eagle spamd[18313]: prefork: child states: BBIBB
Jan  9 18:30:54 eagle spamd[18434]: spamd: processing message
<j$39-27i184-$1g@f
nsdvt.mqon> for mail:561
Jan  9 18:31:07 eagle spamd[18433]: spamd: connection from localhost
[127.0.0.1]
 at port 34888
Jan  9 18:31:09 eagle spamd[18433]: spamd: processing message
<7-10168083-19L6LK
JNFKCLbi4Zru@em.prewards.com> for mail:561
Jan  9 18:31:27 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:31:28 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:31:28 eagle spamd[18313]: prefork: select returned undef!
recovering
Jan  9 18:31:51 eagle spamd[18414]: spamd: clean message (1.3/5.0) for
mail:561
in 234.4 seconds, 3145 bytes.
Jan  9 18:32:05 eagle spamd[18414]: spamd: result: .  1 -
HTML_IMAGE_ONLY_16,HTM
L_MESSAGE,MIME_BOUND_NEXTPART,MISSING_HEADERS,SARE_HTML_BODY_2SP,TO_CC_NONE
scan
time=234.4,size=3145,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=
127.0.0.1,rport=46555,mid=<191690-2200612100271598@PSU.EDU>,autolearn=disabled
Jan  9 18:32:07 eagle spamd[18414]: __alarm__
Jan  9 18:32:08 eagle spamd[18414]: __alarm__
Jan  9 18:32:08 eagle spamd[18313]: prefork: child states: IBBBB
Jan  9 18:32:08 eagle spamd[18415]: spamd: clean message (0.1/5.0) for
mail:561
in 308.9 seconds, 9149 bytes.
Jan  9 18:32:09 eagle spamd[18415]: spamd: result: .  0 -
HTML_MESSAGE,TW_TB sca
ntime=308.9,size=9149,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr
=127.0.0.1,rport=46551,mid=<001501c6157c$2a666740$8200a8c0@David>,autolearn=disa
bled
Jan  9 18:32:09 eagle spamd[18313]: prefork: child states: IIBBB
Jan  9 18:32:27 eagle spamd[18414]: spamd: connection from localhost
[127.0.0.1]
 at port 34889
Jan  9 18:32:32 eagle spamd[18414]: spamd: processing message
<158c195h.6664687@
yahoo.com> for mail:561
Jan  9 18:32:45 eagle spamd[18433]: spamd: identified spam (23.8/5.0)
for mail:5
61 in 97.9 seconds, 4857 bytes.
Jan  9 18:32:46 eagle spamd[18433]: spamd: result: Y 23 -
BAD_CREDIT,DCC_CHECK,H
TML_IMAGE_RATIO_02,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_
RANGE_E4_51_100,SARE_MSGID_DDDASH,SARE_SUB_POOR_CREDIT,URIBL_JP_SURBL,URIBL_OB_S
URBL,URIBL_SBL,URIBL_WS_SURBL,WLS_URI_OPT_14
scantime=97.9,size=4857,user=mail,u
id=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=34888,mid=<7-101
68083-19L6LKJNFKCLbi4Zru@em.prewards.com>,autolearn=disabled
Jan  9 18:32:48 eagle spamd[18415]: spamd: connection from localhost
[127.0.0.1]
 at port 34893
Jan  9 18:32:51 eagle spamd[18415]: spamd: processing message
<200601100031.k0A0
VLnG016456@smtp30.hccnet.nl> for mail:561
Jan  9 18:33:36 eagle spamd[18434]: spamd: identified spam (25.3/5.0)
for mail:5
61 in 174.2 seconds, 10711 bytes.
Jan  9 18:34:01 eagle spamd[18434]: spamd: result: Y 25 -
DCC_CHECK,FORGED_MUA_E
UDORA,FROM_HAS_MIXED_NUMS,FROM_ILLEGAL_CHARS,HEAD_ILLEGAL_CHARS,HTML_IMAGE_RATIO
_02,HTML_MESSAGE,HTTP_EXCESSIVE_ESCAPES,MIME_HTML_ONLY,MIME_HTML_ONLY_MULTI,MPAR
T_ALT_DIFF,RCVD_DOUBLE_IP_LOOSE,RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO,REPTO_QU
OTE_QUALCOMM,SARE_HTML_TITLE_1WD,SUBJ_ALL_CAPS,SUBJ_ILLEGAL_CHARS,URIBL_SBL
scan
time=174.2,size=10711,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr
=127.0.0.1,rport=46566,mid=<j$39-27i184-$1g@fnsdvt.mqon>,autolearn=disabled
Jan  9 18:34:05 eagle spamd[18433]: __alarm__
Jan  9 18:34:07 eagle spamd[18433]: __alarm__
Jan  9 18:34:07 eagle spamd[18313]: prefork: child states: BBIBB
Jan  9 18:34:08 eagle spamd[18414]: spamd: identified spam (11.2/5.0)
for mail:5
61 in 71.1 seconds, 12360 bytes.
Jan  9 18:34:08 eagle spamd[18414]: spamd: result: Y 11 -
DATE_IN_PAST_96_XX,FOR
GED_MSGID_YAHOO,HELO_DYNAMIC_COMCAST,HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_SHORT_
LINK_IMG_1,MIME_HTML_ONLY,SARE_UNF
scantime=71.1,size=12360,user=mail,uid=561,re
quired_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=34889,mid=<158c195h.66646
87@yahoo.com>,autolearn=disabled
Jan  9 18:34:08 eagle spamd[18313]: prefork: child states: BBIIB
Jan  9 18:34:08 eagle spamd[18414]: __alarm__
Jan  9 18:34:09 eagle spamd[18414]: __alarm__
Jan  9 18:34:09 eagle spamd[18313]: prefork: child states: IBIIB
Jan  9 18:34:12 eagle spamd[18313]: spamd: handled cleanup of child
pid 18434 du
e to SIGCHLD
Jan  9 18:34:13 eagle spamd[18313]: prefork: child states: IBIB
Jan  9 18:34:26 eagle spamd[18464]: child processing timeout at
/usr/local/bin/s
pamd line 1085, <GEN20> line 36.
Jan  9 18:34:28 eagle spamd[18464]: child processing timeout at
/usr/local/bin/s
pamd line 1085, <GEN20> line 36.
Jan  9 18:34:35 eagle spamd[18414]: spamd: connection from localhost
[127.0.0.1]
 at port 34897
Jan  9 18:34:37 eagle spamd[18414]: spamd: processing message
<E1Ew7Rr-0004si-00
@eagle.netwrx1.com> for mail:561
Jan  9 18:34:37 eagle spamd[18433]: spamd: connection from localhost
[127.0.0.1]
 at port 34898
Jan  9 18:34:37 eagle spamd[18433]: spamd: processing message
<dpuvat+gdo9@eGrou
ps.com> for mail:561
Jan  9 18:35:02 eagle spamd[18464]: spamd: identified spam (25.9/5.0)
for mail:5
61 in 364.8 seconds, 1867 bytes.
Jan  9 18:35:02 eagle spamd[18464]: spamd: result: Y 25 -
HTML_FONT_FACE_BAD,HTM
L_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2
_CHECK,SARE_UNG,SARE_UNQ,TW_OQ,TW_QH,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURB
L,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NOVOWEL,WLS_URI_OPT_3415
scantime=364.8,size
=1867,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport
=46559,mid=<dKqBnX.squirrel@218.208.51.3>,autolearn=disabled
Jan  9 18:35:10 eagle spamd[18415]: spamd: clean message (0.0/5.0) for
mail:561
in 143.6 seconds, 3130 bytes.
Jan  9 18:35:15 eagle spamd[18415]: spamd: result: .  0 -
UNPARSEABLE_RELAY scan
time=143.6,size=3130,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=
127.0.0.1,rport=34893,mid=<200601100031.k0A0VLnG016456@smtp30.hccnet.nl>,autolea
rn=disabled
Jan  9 18:35:34 eagle spamd[18415]: __alarm__
Jan  9 18:35:34 eagle spamd[18415]: __alarm__
Jan  9 18:35:35 eagle spamd[18415]: spamd: connection from localhost
[127.0.0.1]
 at port 34904
Jan  9 18:35:37 eagle spamd[18313]: prefork: child states: BBBB
Jan  9 18:35:42 eagle spamd[18415]: spamd: processing message
<001101c60f1c$66c0
f360$0200a8c0@pc> for mail:561
Jan  9 18:35:57 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18903
Jan  9 18:36:03 eagle spamd[18464]: __alarm__
Jan  9 18:36:04 eagle spamd[18464]: __alarm__
Jan  9 18:36:05 eagle spamd[18313]: prefork: child states: BBBBB
Jan  9 18:36:06 eagle spamd[18313]: spamd: server successfully spawned
child pro
cess, pid 18921
Jan  9 18:36:09 eagle spamd[18903]: spamd: connection from localhost
[127.0.0.1]
 at port 34905
Jan  9 18:36:18 eagle spamd[18464]: spamd: connection from localhost
[127.0.0.1]
 at port 34906
Jan  9 18:36:21 eagle spamd[18313]: prefork: child states: BBBBBI
Jan  9 18:36:21 eagle spamd[18464]: spamd: processing message
<KVKFJSEOCGRPSDDQU
FZSZP@yahoo.com> for mail:561
Jan  9 18:37:34 eagle spamd[18903]: spamd: processing message
<000001c615a7$d8f0
8180$0100007f@localhost> for mail:561
Jan  9 18:38:04 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:38:11 eagle spamd[18313]: tcp timeout at
/usr/local/lib/perl5/site_per
l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195.
Jan  9 18:38:11 eagle spamd[18313]: prefork: select returned undef!
recovering
Jan  9 18:38:15 eagle spamd[18433]: spamd: clean message (0.0/5.0) for
mail:561
in 210.0 seconds, 3170 bytes.
Jan  9 18:38:18 eagle spamd[18433]: spamd: result: .  0 -
scantime=210.0,size=3
170,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=3
4898,mid=<dpuvat+gdo9@eGroups.com>,autolearn=disabled
Jan  9 18:38:28 eagle spamd[18433]: __alarm__
Jan  9 18:38:32 eagle spamd[18433]: __alarm__
Jan  9 18:38:42 eagle spamd[18414]: spamd: identified spam (10.1/5.0)
for mail:5
61 in 233.4 seconds, 1581 bytes.
Jan  9 18:39:19 eagle spamd[18414]: spamd: result: Y 10 -
DCC_CHECK,MAILTO_TO_SP
AM_ADDR,MIME_BASE64_TEXT,MISSING_MIMEOLE,MSGID_FROM_MTA_ID,RCVD_NUMERIC_HELO,SUB
J_ILLEGAL_CHARS
scantime=233.4,size=1581,user=mail,uid=561,required_score=5.0,rh
ost=localhost,raddr=127.0.0.1,rport=34897,mid=<E1Ew7Rr-0004si-00@eagle.netwrx1.c
om>,autolearn=disabled
Jan  9 18:40:09 eagle spamd[18464]: spamd: identified spam (51.0/5.0)
for mail:5
61 in 143.3 seconds, 2071 bytes.
Jan  9 18:40:44 eagle spamd[18464]: spamd: result: Y 50 -
DATE_IN_FUTURE_96_XX,D
ATE_SPAMWARE_Y2K,FORGED_MUA_OUTLOOK,FORGED_OUTLOOK_HTML,FORGED_YAHOO_RCVD,FROM_I
LLEGAL_CHARS,FROM_LOCAL_NOVOWEL,HEAD_ILLEGAL_CHARS,HTML_IMAGE_ONLY_12,HTML_IMAGE
_RATIO_02,HTML_MESSAGE,HTTP_EXCESSIVE_ESCAPES,MIME_BOUND_DD_DIGITS,MIME_HTML_ONL
Y,MIME_HTML_ONLY_MULTI,MISSING_MIMEOLE,MPART_ALT_DIFF,MSGID_SPAM_CAPS,MSGID_YAHO
O_CAPS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_NUMERI
C_HELO,REPTO_QUOTE_YAHOO,SARE_BOUNDARY_07,SUBJ_ILLEGAL_CHARS,UNPARSEABLE_RELAY,U
RIBL_JP_SURBL,URIBL_WS_SURBL,WLS_URI_OPT_2599,X_PRIORITY_HIGH
scantime=143.3,siz
e=2071,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rpor
t=34906,mid=<KVKFJSEOCGRPSDDQUFZSZP@yahoo.com>,autolearn=disabled
Jan  9 18:41:07 eagle spamd[18464]: __alarm__
Jan  9 18:41:08 eagle spamd[18464]: __alarm__
Jan  9 18:41:08 eagle spamd[18414]: __alarm__
Jan  9 18:41:09 eagle spamd[18414]: __alarm__
Jan  9 18:41:09 eagle spamd[18415]: spamd: identified spam (48.5/5.0)
for mail:5
61 in 195.3 seconds, 3443 bytes.
Jan  9 18:41:38 eagle spamd[18415]: spamd: result: Y 48 -
DATE_IN_PAST_06_12,DCC
_CHECK,DIGEST_MULTIPLE,DRUGS_ERECTILE,DRUGS_PAIN,DRUGS_PAIN_EREC,DRUG_DOSAGE,DRU
G_ED_CAPS,FORGED_YAHOO_RCVD,HTML_FONT_BIG,HTML_FONT_SIZE_LARGE,HTML_MESSAGE,HTML
_TAG_EXIST_TBODY,INFO_TLD,MSGID_OUTLOOK_INVALID,NO_REAL_NAME,ONLINE_PHARMACY,RAZ
OR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_C
HECK,SARE_SUB_AS_LOW_AS,SARE_SUPERVIAGRA,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_
SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_17
44
scantime=195.3,size=3443,user=mail,uid=561,required_score=5.0,rhost=localhost
,raddr=127.0.0.1,rport=34904,mid=<001101c60f1c$66c0f360$0200a8c0@pc>,autolearn=d
isabled
Jan  9 18:41:39 eagle spamd[18903]: child processing timeout at
/usr/local/bin/s
pamd line 1085, <GEN23> line 147.
Jan  9 18:41:39 eagle spamd[18903]: child processing timeout at
/usr/local/bin/s
pamd line 1085, <GEN23> line 147.
Jan  9 18:41:39 eagle spamd[18903]: child processing timeout at
/usr/local/bin/s
pamd line 1085, <GEN23> line 147.
Jan  9 18:42:09 eagle spamd[18903]: spamd: identified spam (35.4/5.0)
for mail:5
61 in 364.8 seconds, 14725 bytes.
Jan  9 18:44:25 eagle spamd[18903]: spamd: result: Y 35 -
HTML_MESSAGE,HTML_TAG_
EXIST_TBODY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_
51_100,RAZOR2_CHECK,RM_bpoem_InstantDL,SARE_OEM_AND_OTHER,SARE_OEM_A_1,SARE_OEM_
NEW_TITLES,SARE_OEM_PRODS_1,SARE_OEM_PRODS_FEW,SARE_OEM_S_PRICE,SARE_OEM_UPPER_E
YE,SARE_PRODS_LOTS,SARE_PRODUCTS_02,SARE_PRODUCTS_03,SARE_PRODUCTS_04,URIBL_JP_S
URBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,WLS_URI_OPT_2170
scantime=364.8,siz
e=14725,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rpo
rt=34905,mid=<000001c615a7$d8f08180$0100007f@localhost>,autolearn=disabled
Jan  9 18:47:04 eagle spamd[18921]: prefork: sysread(12) failed after
300 secs a
t
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
lin
e 501.
Jan  9 18:48:21 eagle spamd[18433]: prefork: sysread(9) failed after
300 secs at

/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
line
 501.
Jan  9 18:48:34 eagle spamd[18903]: __alarm__
Jan  9 18:48:36 eagle spamd[18464]: prefork: sysread(11) failed after
300 secs a
t
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
lin
e 501.
Jan  9 18:48:37 eagle spamd[18414]: prefork: sysread(7) failed after
300 secs at

/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
line
 501.
Jan  9 18:48:38 eagle spamd[18415]: prefork: sysread(8) failed after
300 secs at

/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
line
 501.
Jan  9 18:48:39 eagle spamd[18903]: __alarm__
Jan  9 18:53:03 eagle spamc[19415]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#1 of 3): Connection refused
Jan  9 18:53:04 eagle spamc[19415]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#2 of 3): Connection refused
Jan  9 18:53:04 eagle spamc[19420]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#1 of 3): Connection refused
Jan  9 18:53:05 eagle spamc[19415]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#3 of 3): Connection refused
Jan  9 18:53:05 eagle spamc[19420]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#2 of 3): Connection refused
Jan  9 18:53:06 eagle spamc[19415]: connection attempt to spamd
aborted after 3
retries
Jan  9 18:53:06 eagle spamc[19420]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#3 of 3): Connection refused
Jan  9 18:53:07 eagle spamc[19420]: connection attempt to spamd
aborted after 3
retries
Jan  9 18:53:08 eagle spamc[19429]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#1 of 3): Connection refused
Jan  9 18:53:09 eagle spamc[19429]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#2 of 3): Connection refused
Jan  9 18:53:09 eagle spamc[19437]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#1 of 3): Connection refused
Jan  9 18:53:10 eagle spamc[19429]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#3 of 3): Connection refused
Jan  9 18:53:10 eagle spamc[19437]: connect(AF_INET) to spamd at
127.0.0.1 faile
d, retrying (#2 of 3): Connection refused
Jan  9 18:53:11 eagle spamc[19429]: connection attempt to spamd
aborted after 3
retries
Comment 10 George R. Kasica 2006-01-11 13:32:26 UTC
Created attachment 3328 [details]
mail log traces w/ -Dprefork

strace files for bug as requested
Comment 11 Justin Mason 2006-01-11 22:55:27 UTC
those logs are a little odd.

spamd-maillog:
Jan 10 23:05:07 eagle spamd[24378]: spamd: server successfully spawned child pro
cess, pid 26130

spamd-trace2:
24378 23:04:49 fork()                   = 26130
26130 23:04:49 getpid()                 = 26130
26130 23:04:50 open("/etc/passwd", O_RDONLY) = 15


Is the maillog from a different machine, with clock skew between the two?  
otherwise I can't see how it took syslogd 17 seconds to syslog that message.

however -- It *could* be something to do with a slow/remote syslogd slowing up
logging, or extremely heavy load.  I notice this in spamd-trace2...

26130 23:04:50 write(4, "<23>spamd[26130]: prefork: sysre"..., 69 <unfinished ...>
26130 23:04:52 <... write resumed> )    = 69

in other words it took 2 seconds just for the syslogd write() call to complete.
 that's a *long* time for logging!

actually, looking further into the strace it looks increasingly like load.
e.g.


26130 23:05:04 accept(5, {sin_family=AF_INET, sin_port=htons(45076), sin_addr=inet_a
ddr("127.0.0.1")}}, [16]) = 13

26130 23:05:04 write(14, "B\0\0f\22\n", 6) = 6
26130 23:05:07 read(13, " 1 Office Pro 2003 \n  2 Adobe Ph"..., 1024) = 1024
26130 23:05:17 read(13, "height=3D23></A> <BR> <BR> <B>Av"..., 1024) = 556
26130 23:05:17 read(13, "height=3D23></A> <BR> <BR> <B>Av"..., 1024) = 556
26130 23:06:25 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0

now: >1 minute without performing system calls? that's a busy CPU working on
parsing. in general, that scan process carries on as normal, but very slowly;
then:

26130 23:10:07 --- SIGALRM (Alarm clock) ---


So it really is timing out, simply due to extremely high CPU load, it
looks like.

I would look at why it takes 2 seconds to log a syslog message; that's bad.

--j.
Comment 12 George R. Kasica 2006-01-12 01:06:45 UTC
The logs are off the same machine here. 

Without SA 3.1 running system load is usually < 4 and usually under 1. However 
running spamd 3.1 will drive it up to near 10 or more. I came home to see it 
running at 84 (that's not a typo) and once I killed SA we're back to 2.7 in a 
matter of 5 minutes. SA 3.0.4 did not behave this way.

I haven't changed anything in the way SA is configured and at first when you 
start it it will behave normally but over time the CPU load will climb until 
the program fails and is no longer running.

I'm not a programmer here so I don't know what more to add, I'd be happy to try 
or test anything you like given clear instructions.
Comment 13 Justin Mason 2006-01-13 20:27:37 UTC
from the mailing list:

'Things are looking back to normal here with 4.60 and 3.1 now after
some help by Larry Rosenman <ler@lerctr.org>, turns out that one of
the SARE rules (I'm thinking one of the blacklist URI ones about 13
meg in size (sorry I don't recall the name of it at the moment) may
have been causing our grief along with not being on Perl 5.8.7 which
we also did here. After cleaning up those 2 items it is looking good
so far.'

IOW, George's case was just a load issue caused by the sa-blacklist and/or
sa-blacklist-uri rulesets, not a bug.
Comment 14 Mike 2006-01-19 16:39:57 UTC
(In reply to comment #13)
> from the mailing list:
> 
> 'Things are looking back to normal here with 4.60 and 3.1 now after
> some help by Larry Rosenman <ler@lerctr.org>, turns out that one of
> the SARE rules (I'm thinking one of the blacklist URI ones about 13
> meg in size (sorry I don't recall the name of it at the moment) may
> have been causing our grief along with not being on Perl 5.8.7 which
> we also did here. After cleaning up those 2 items it is looking good
> so far.'
> 
> IOW, George's case was just a load issue caused by the sa-blacklist and/or
> sa-blacklist-uri rulesets, not a bug.


So is there a solution to this problem?
Comment 15 Mike 2006-01-19 16:40:50 UTC
Isn't this a duplicate of Bug #4563 ?
Comment 16 Justin Mason 2006-01-19 18:19:50 UTC
'So is there a solution to this problem?'

George's bug -- yes, get rid of large rulesets.  This, I think, is the problem
for people who are seeing it happen at all times of day.

Greg's bug -- not necessarily, since they haven't commented if this fixes their
problems too, and also there's the thing that they're only seeing it at a
certain time of day.
Comment 17 Mike 2006-01-23 19:00:59 UTC
I have eliminated all occurances of this error by upgrading the memory on my
mailserver.  Is there any way that the occurance of this is directly related to
the amount of available memory the system has?  I upgraded the RAM (previous RAM
all tests out as good RAM), and now I'm not having this error any more.  I'm
curious as to the amount of available memory that people have, who are also
experiencing this problem.
Comment 18 Justin Mason 2006-01-23 19:24:14 UTC
'I have eliminated all occurances of this error by upgrading the memory on my
mailserver.'

note that it *is* load-related, and if your spamd processes are hitting swap,
that'll greatly increase load and scan-times.  so yes, it'd correlate with
memory use.
Comment 19 Justin Mason 2006-01-25 04:17:27 UTC
I've run into this on my own server, btw, and it was *definitely* load-related
there.  It doesn't take a lot of load to cause it, either.

What happens is, the select call uses a 2-second timeout.

Now, this alarm in question is intended to "sanity-check" that select call and
catch the case where it ignores that timeout.  It does this by doubling the
select's timeout, and if this second (backup) timeout is hit, it complains,
because select() should have timed out itself, instead of the backup timeout
being hit:

  # use alarm to back up select()'s built-in alarm, to debug theo's bug
  eval {
    Mail::SpamAssassin::Util::trap_sigalrm_fully(sub { die "pf select timeout"; });
    alarm ($tout*2) if ($tout);
    ($nfound, $timeleft) = select($rout=$rin, undef, $eout=$rin, $tout);
  };
  alarm 0;


Unfortunately, double 2 is only 4, which really isn't very big -- it's pretty
easy for a sudden load spike to stop a single process from executing two
consecutive lines of a perl script within 2 seconds, especially since there's an
eval {} context switch between the two.
Comment 20 Justin Mason 2006-01-25 04:23:11 UTC
Created attachment 3343 [details]
possible fix

here's a possible fix; I've applied it to SVN trunk anyway, as r372085, and
will be running it on my server to see if it helps.

(note to self: I applied this at 
: [no view] exit=0 Wed Jan 25 03:22:37 GMT 2006
last timeout log was at 'Jan 25 03:06:54 dogma spamd[27797]'.)
Comment 21 Justin Mason 2006-01-25 22:34:16 UTC
seems to have cleared it up here.  I suggest anyone else seeing this give that
patch a try....
Comment 22 Daryl C. W. O'Shea 2006-01-27 03:47:08 UTC
I, too, had this happen this morning while moving 10 GB across IDE drives.

Patch looks reasonable to me.  Moving to 3.1.1 queue.
Comment 23 Justin Mason 2006-01-28 01:07:29 UTC
nah, this didn't fix it -- it showed up again today under *massive* load (like
load average > 180, caused by other procs on the system).

it's the right general approach though, so I'm tidying it up and looking into
some other spurious "errors" that also showed up during that incident.
Comment 24 Justin Mason 2006-01-28 02:30:59 UTC
Created attachment 3348 [details]
better fix

ok, this is a better fix.  That code:

(a) could leak alarm die() noise (a la bug 4631);
(b) the extra few seconds were still not enough;
(c) it wasn't protecting itself against any other active timeouts, so a
timeout-within-a-timeout wasn't going to work;
(d) it should have been stopping the timeout timer before the eval { } scope
was exited, instead of after;
(e) it should have been treating an alarm timeout event as equivalent to the
select() timeout, instead of issuing warnings, since under heavy load, the
alarm() timeout may be noticed first.

So, all in all, (b) and (e) were the big issues that causes this bug.	All the
others, (a), (c), and (d), are ones that we have fixed in other places (e.g.
cf. bug 4631), but not yet here.

BTW obviously that's not good.	I plan to abstract all these timeout timers
into a new class in svn trunk, I'm sick of fixing the same bugs again and again
due to duplication :(
Comment 25 Daryl C. W. O'Shea 2006-01-28 22:16:53 UTC
I was going to ask if 10 seconds was long enough.  I bet it wouldn't have when
cyan hit a load average of 980 two weeks ago. ;)


Are the leaked messages errors or warns?
Comment 26 Justin Mason 2006-01-30 19:01:51 UTC
the leaked messages are those __alarm__ warns; ugly and misleading.
Comment 27 Justin Mason 2006-01-31 03:20:43 UTC
*** Bug 4563 has been marked as a duplicate of this bug. ***
Comment 28 Justin Mason 2006-01-31 03:23:52 UTC
the "prefork: select returned undef! recovering" messages are also spurious when
they appear directly after "spamd: handled cleanup of child pid 13101 due to
SIGCHLD".

I have a fix for this, but it's tied up with a bigger change, so not so easy to
patch now...
Comment 29 Chris Thielen 2006-01-31 04:43:00 UTC
I'm coming over from bug 4563

Just wanted to report that since I have applied the first patch I haven't seen any timeouts in my log.  I have 
since applied the second patch.
Comment 30 Dallas Engelken 2006-01-31 05:24:29 UTC
err, and nobody agreed with me on comment #3  ;)

i've applied the patch and the issue has disappeared here.
Comment 31 Dallas Engelken 2006-01-31 05:27:10 UTC
oops, let me clarify.  i havent had the problem outlined in bug 4693 since
applying this patch..  FWIW.
Comment 32 Justin Mason 2006-02-01 02:54:51 UTC
Created attachment 3352 [details]
further fixing

ok, here's the big change.   This one adds a new Timeout class to portably and
cleanly encapsulate SIGALRM/POSIX messiness throughout the code; capture and
inhibit a spurious "select returned -1" error when spamd children exit; fix an
incorrect comment about write retries in SpamdPreforkScaling.

it's in trunk as r373962.
Comment 33 Justin Mason 2006-02-02 01:50:14 UTC
Created attachment 3353 [details]
silly further fix

another one, on top of 3348 and 3352.	This fixes a minor bug where alarms
were being treated as normal die() failures instead of timeouts due to a typo.

So far I haven't seen any "tcp timeout", "alarm failed", or "select failed"
warnings; they've cleared right up on my dogfood server.   Strongly recommend
trying this out...
Comment 34 Tom Schulz 2006-02-02 21:12:16 UTC
For anyone applying the last 2 patches to 3.1.
You need to have the patches from bug 4651 installed (so that they can be
removed by this patch :-)
Hunk #2 of the MANIFEST patch will fail.  Don't worry about that.
I had to do a 'make clean', a 'perl Makefile.PL' & a 'make' to get the new
Timeout.pm to be placed into the blib tree.
The patch to Conf.pm seems to be a strange thing to include here.
Comment 35 Tom Schulz 2006-02-02 21:17:39 UTC
> You need to have the patches from bug 4651 installed...

Make that bug 4631.
Comment 36 Daryl C. W. O'Shea 2006-02-03 00:59:40 UTC
A few things:

1) I'm pretty sure that this patch just masks any problem that is occurring in
bug 4693.  It doesn't actually fix it.

2) Justin's commits have said "workaround for spurious 'tcp timeout' warnings".
 This bug says (comment #0) that spamd actually stops responding to connections.
 It's not just spurious warnings like reported in bug 4563 (marked as
duplicate).  So I'd say this fixes bug 4563, which isn't actually a duplicate
(if the warnings Justin, Tom and Chris are seeing are in fact just spurious) and
that this bug (bug 4696) isn't yet fixed (unless anyone has seen otherwise --
GREG ARE YOU OUT THERE?!).

3) If this is all working for folks, is there a mininmal patch for 3.1.1?  With
the difficulty in getting votes for 10 line patches, I can just imagine the
support for adding a new (Timeout) class to the 3.1 branch.
Comment 37 Justin Mason 2006-02-03 01:55:00 UTC
Any real problems are being obscured -- quite a lot -- by this error message. 
So far, this has been the main place people are talking about this error, so
there probably are indeed multiple failure cases in this bug ID.   This patch
fixes *some* of them ;)

also I'm not really expecting anyone to actually *vote* on the patch for 3.1.1,
given recent voting rates. ;)   But people who are running servers seem happy
enough to have a patch that can be applied (with a little work) to clean up the
errors.
Comment 38 Daryl C. W. O'Shea 2006-02-03 02:34:31 UTC
Yeah, I'm not worried about getting a fix for "spurious warnings" into 3.1.

On the other hand, fixes for actual spamd *failures* need to go into 3.1.1. 
After 4+ months of telling people to use --round-robin, it'd be nice to get a
release out that fixes at least some of the problems.  I'm just trying to
identify what has been fixed by the patches in this bug, and what's necessary to
get the needed fixes into 3.1 without too much voter apathy.


GREG, SEB, TUOMAS: Do Justin's patches stop spamd from dying on you?
Comment 39 Justin Mason 2006-02-03 03:05:46 UTC
'After 4+ months of telling people to use --round-robin, it'd be nice to get a
release out that fixes at least some of the problems.  I'm just trying to
identify what has been fixed by the patches in this bug, and what's necessary to
get the needed fixes into 3.1 without too much voter apathy.'

agreed.

FWIW, I'm finding that much of the apparent failure cases are spurious, and
caused by these warnings making noise about non-error conditions; getting those
out of the way is helping narrow down the *real* bugs.
Comment 40 Tom Schulz 2006-02-03 20:59:34 UTC
I wonder if the cause of spamd hanging is that when there is a timeout or
communication is lost with a child, the problem child is not properly
disposed of.  It looks like patch 3352 does include a change to child
cleanup.  Also, making the timeout code more regular should help in
analyzing the related child cleanup code.  The patch is not really that
big (part of it is the same change in 5 files).  It should be considered
for 3.1.1.  It would also be good if people would try it out and see how
many problems it fixes.
Comment 41 Justin Mason 2006-02-10 20:06:02 UTC
Created attachment 3365 [details]
aggregated patch against b3_1_0 svn branch

This new patch aggregates patches 3343, 3348, 3352, and 3353; it's a lot easier
to apply, and is against the current code on SVN branch b3_1_0.  there are no
functional changes over those patches.

Important: if you're rebuilding with this patch applied, you need to run "make
clean", re-run "perl Makefile.PL", "make", after applying the patch, since it
adds a new file.
Comment 42 Justin Mason 2006-02-10 20:28:32 UTC
Created attachment 3366 [details]
aggregated patch against b3_1_0 svn branch take 2

Minor redo of patch 3365; that "TODO" comment in Conf.pm slipped in, removing.
Comment 43 Justin Mason 2006-02-10 20:30:24 UTC
Created attachment 3367 [details]
aggregated patch, against the 3.1.0 release tarball

This is the aggregated patch, against the 3.1.0 tarball, for people running
into issues with 3.1.0 who want to try this out.

(aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and
b3_1_0 are major by now.)
Comment 44 Justin Mason 2006-02-11 01:55:53 UTC
damn diff/patch.  in the agg patch against 3.1.0, Timeout.pm is left in the
top-level directory, instead of its correct location.  In addition, there was a
bonus line of code that shouldn't have been there, in Pyzor.pm.  new patch on
the way...
Comment 45 Justin Mason 2006-02-11 02:08:21 UTC
Created attachment 3368 [details]
aggregated patch, against the 3.1.0 release tarball, take 2

redo, again
Comment 46 Tom Schulz 2006-02-13 16:32:13 UTC
The following mail to the dev list from Daryl did not make it in as a comment.

>> ------- Additional Comments From jm@jmason.org  2006-02-10 20:30 -------
>>
>> (aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and
>> b3_1_0 are major by now.)
>
>++1  We're approaching 5 months since the 3.1.0 release.
>
>It'd be nice if we had some feedback on whether the patch in bug 4696 
>was resolving people's problems with spamd.  I'd imagine it is, but it'd 
>be nice to know for sure.
Comment 47 Daryl C. W. O'Shea 2006-02-14 22:57:49 UTC
Tom, any problems since applying the latest patch?  You seem to be the only one
left alive...
Comment 48 Ryan Martin 2006-02-14 23:16:15 UTC
(In reply to comment #46)
> The following mail to the dev list from Daryl did not make it in as a comment.
> 
> >> ------- Additional Comments From jm@jmason.org  2006-02-10 20:30 -------
> >>
> >> (aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and
> >> b3_1_0 are major by now.)
> >
> >++1  We're approaching 5 months since the 3.1.0 release.
> >
> >It'd be nice if we had some feedback on whether the patch in bug 4696 
> >was resolving people's problems with spamd.  I'd imagine it is, but it'd 
> >be nice to know for sure.

<delurk>
I've been having this problem in 3.1.0 (tarball), starting with tcp timeouts in
SpamdForkScaling.pm, then "prefork: select returned undef! recovering", and
finally "prefork: syswrite(7) failed, retrying... at /usr
/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 554."
which would totally paralyze spamd and require a restart. I wasn't able to get
the first set of patches to work, but the aggregate against 3.1.0 (tarball)
worked fine. spamd has been running clean all day with the new patch in place.
Thanks for the patch!
Comment 49 Tom Schulz 2006-02-15 22:58:59 UTC
I only have these problems when I go on vacation, so I won't be much help
in evaluating how much this patch helps in the short term.  I am hoping that
there will be a final fix for this and the other similar problems before I
go out of town again.
A couple of things that I noticed.  The patch from bug 4590 is not included
here, but the fix for the bad comment from that patch is included here.
The 'silly further fix' (attachment id-3353) was not included.
Comment 50 Justin Mason 2006-02-16 16:52:15 UTC
'A couple of things that I noticed.  The patch from bug 4590 is not included
here, but the fix for the bad comment from that patch is included here.
The 'silly further fix' (attachment id-3353) was not included.'

you're right on both counts.  I need to combine the bug 4590 patch in here,
and include the "silly further fix", in a new patch.
Comment 51 Tim 2006-02-16 23:44:15 UTC

Is this the same problem I am having (see sample of log below) my sa is version
3.1.0. I have SA sat on a desktop PC and working in conjuction with kmail on
e-mails from ISP's It has been setup like this for 18 months without a problem.

Sorry if I am doing this wrong but it is the first time I have submitted a bug
(or something that maybe a bug)

Tim

Feb 13 10:34:25 localhost spamd[3989]: Can't locate Mail/SPF/Query.pm in @INC
(@INC contains: ../lib /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.7
/usr/local/share/perl/5.8.7 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8
/usr/local/lib/site_perl) at /usr/share/perl5/Mail/SpamAssassin/Plugin/SPF.pm
line 272, <GEN304> line 63. 
Feb 13 10:34:25 localhost spamd[3989]: Can't locate Mail/SPF/Query.pm in @INC
(@INC contains: ../lib /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.7
/usr/local/share/perl/5.8.7 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8
/usr/local/lib/site_perl) at /usr/share/perl5/Mail/SpamAssassin/Plugin/SPF.pm
line 272, <GEN304> line 63. 
Feb 13 10:34:26 localhost spamd[3989]: spamd: identified spam (26.8/5.0) for
mit:1000 in 1.6 seconds, 2886 bytes. 
Feb 13 10:34:26 localhost spamd[3989]: spamd: result: Y 26 -
BAYES_99,DATE_IN_FUTURE_12_24,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_BOGUSMX,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,MIME_BASE64_BLANKS,MIME_BASE64_NO_NAME,MIME_BASE64_TEXT,RATWARE_NAME_ID,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_WHOIS_INVALID,RCVD_IN_XBL
scantime=1.6,size=2886,user=mit,uid=1000,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=36224,mid=<EDBPOMNMAIMGDMGGNNKFDIBLCNAB.edmondfink_xo@ceril.cl>,bayes=1,autolearn=spam

Feb 13 10:34:27 localhost spamd[3968]: prefork: child states: II 
Feb 13 13:55:02 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:55:16 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:55:44 localhost spamd[3968]: prefork: select returned undef! recovering 
Feb 13 13:56:39 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:56:43 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:57:50 localhost last message repeated 2 times
Feb 13 13:57:57 localhost spamd[3968]: prefork: select returned undef! recovering 
Feb 13 13:58:38 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:59:14 localhost spamd[3968]: tcp timeout at
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. 
Feb 13 13:59:38 localhost last message repeated 2 times
Feb 13 13:59:42 localhost spamd[3968]: panic: leave_scope inconsistency. 
Feb 13 15:05:27 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#1 of 3): Connection refused
Feb 13 15:05:28 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#2 of 3): Connection refused
Feb 13 15:05:29 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1
failed, retrying (#3 of 3): Connection refused
Feb 13 15:05:30 localhost spamc[23543]: connection attempt to spamd aborted
after 3 retries
Comment 52 Gambit 2006-02-17 03:27:22 UTC
I manage quite a few RHES3 boxes, running SA. The load on the boxes are quite 
high.

I was having problems described in bug 4696 and 4590. After turning on "round-
robin" I was able to avoid the problems, but still every once in a while SA 
would crash and stop doing anything with logs like:

@4000000043f03e18333f0e84 [12505] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000043f03e18333f6c44 [12505] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN841> line 204.
@4000000043f03e18333fb67c [12505] warn: )
@4000000043f03e3305fb22b4 [12207] warn: plugin: eval failed: child processing 
timeout at /usr/bin/spamd line 1088.
@4000000043f03e5113e187ec [12423] warn: rules: failed to run PYZOR_CHECK test, 
skipping:
@4000000043f03e5113e1e994 [12423] warn: _(child processing timeout 
at /usr/bin/spamd line 1088.
@4000000043f03e5113e233cc [12423] warn: )
@4000000043f03e5730b9abc4 [13178] warn: rules: failed to run DCC_CHECK test, 
skipping:
@4000000043f03e5730ba1154 [13178] warn: _(child processing timeout 
at /usr/bin/spamd line 1088.
@4000000043f03e5730ba57a4 [13178] warn: )
@4000000043f03e5818ba6be4 [12897] warn: rules: failed to run PYZOR_CHECK test, 
skipping:
@4000000043f03e5818bacd8c [12897] warn: _(child processing timeout 
at /usr/bin/spamd line 1088.
@4000000043f03e5818bb13dc [12897] warn: )
@4000000043f03f74331b2334 [30053] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000043f03f74331b271c [30053] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN598> line 68.
@4000000043f03f74331b2eec [30053] warn: )
@4000000043f03f7826560894 [13980] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000043f03f7826561064 [13980] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN862> line 335.
@4000000043f03f7826561c1c [13980] warn: )

After applying this patch, I got rid of "round-robin" option. Everything is 
running fine since then (would be almost 2 whole days now). 

I often see messages like below:

@4000000043f532c62bad4064 [9205] warn: pyzor: check failed: __alarm___ignore__
@4000000043f532eb106eb7ec [19553] warn: razor2: razor2 check failed:  
__alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Ma
il/SpamAssassin/Plugin/Razor2.pm line 316.
@4000000043f533783b6905c4 [9205] warn: razor2: razor2 check failed:  
__alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Mai
l/SpamAssassin/Plugin/Razor2.pm line 316.
@4000000043f533a9325fce2c [9205] warn: pyzor: check failed: internal error
@4000000043f533b50382cbcc [9205] warn: razor2: razor2 check failed:  
__alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Mai
l/SpamAssassin/Plugin/Razor2.pm line 316.

But SA keeps running and keeps doing its job.
Comment 53 Justin Mason 2006-02-21 18:35:44 UTC
Created attachment 3380 [details]
aggregated patch against b3_1_0 svn branch take 3

here's a new take on patch 3366; it includes the "silly fix" and the fix for
bug 4590.
Comment 54 Justin Mason 2006-02-21 18:37:36 UTC
Created attachment 3381 [details]
aggregated patch against the 3.1.0 release tarball take 3

and here's the redo of patch 3368.  Again, this is the same code with the fix
for bug 4590 and the "silly fix" merged in.

If you're seeing SpamdForkScaling warnings or errors, we STRONGLY recommend
retrying with svn trunk, or one of these two patches -- probably this one.
Comment 55 Justin Mason 2006-02-21 18:44:53 UTC
ps: Gambit, that patch update would remove those noisy warnings.  but they
should be pretty harmless anyway, so it's really optional.
Comment 56 Gambit 2006-02-22 03:48:11 UTC
Couple of my boxes were still having problems reported in bug 4590. I have just 
rolled out the patch (id=3381) for the servers. Everything seems ok as of yet, 
the warnings have disappeared.

Comment 57 Justin Mason 2006-02-23 13:23:04 UTC
fwiw, with this and bug 4791, I see no warnings during daily spamd use with svn
trunk.  everything's working 100% on my colo, which is nice.
Comment 58 Gambit 2006-02-28 02:28:48 UTC
On the 7th day (after applying the patch id=3381), spamd on one of my servers 
just crashed again.

The output from psaux | grep spamd after crash:
root      2021  0.0  0.0  1444  284 ?        S    Feb24   0:00 supervise spamd
qmaill    2029  0.0  0.0  1592  344 ?        S    Feb24   
0:20 /usr/local/bin/multilog t s20000000 n30 /var/log/spamd
root      2036  0.0  1.3 32096 28800 ?       S    Feb24   1:54 /usr/bin/perl -
T -w /usr/bin/spamd -q -v -x -u vpopmail -s std
err --max-children=16 --min-spare=4 --max-spare=8 --max-conn-per-child=100 --
timeout-child=28
vpopmail 22999  0.0  1.8 42320 37388 ?       S    09:18   0:05 spamd child

vpopmail 27055  0.0  1.4 34900 29540 ?       S    09:25   0:00 spamd child

vpopmail 19934  0.1  1.6 38780 33904 ?       S    10:06   0:12 spamd child

vpopmail 23915  0.0  1.4 35032 29756 ?       S    10:12   0:00 spamd child

vpopmail 25658  0.0  1.4 35308 29996 ?       S    10:15   0:00 spamd child

vpopmail 30469  0.0  1.4 34904 29544 ?       S    10:24   0:00 spamd child

vpopmail  8372  0.4  1.7 41616 36712 ?       S    10:40   0:22 spamd child

vpopmail  9460  0.1  1.6 38176 33240 ?       S    10:42   0:05 spamd child

root     11404  0.0  0.0  3936  572 pts/2    S    12:06   0:00 grep spamd

Log says:
0000044038f8e0eddc74c [9460] info: spamd: result: . -2 - 
BAYES_00,FORGED_RCVD_HELO 
scantime=2.4,size=5393,user=andy@FFF.com,uid=89,required_score=7.5,rhost=localho
st.localdomain,raddr=127.0.0.1,rport=20815,mid=<44038F5E.3020709@ld
FGD.org>,bayes=0,autolearn=ham
@4000000044038f8e1192aa34 [2036] info: prefork: child states: IIIIIIII
@4000000044038f900ba38f94 [8372] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 20818
@4000000044038f900c8a9434 [8372] info: spamd: processing message <E3F38E13-C70A-
4805-AD03-334EE21F0CB8@VVV.com> for andrewg@BBBB.com:89
@4000000044038f9226cb01e4 [8372] info: spamd: clean message (-2.5/7.5) for 
andrewg@BBBB.com:89 in 2.5 seconds, 11207 by
tes.
@4000000044038f9226cb1d3c [8372] info: spamd: result: . -2 - 
AWL,BAYES_00,HTML_MESSAGE 
scantime=2.5,size=11207,user=andrewg@AAAA.com,uid=89,required_score=7.5,rhost=lo
calhost.localdomain,raddr=127.0.0.1,rport=20818,mid=<E3F38E13-C70A-4805-AD03-
334EE21F0CB8@VVV.com>,bayes=0,autolearn=ham
@4000000044038f922a12ca44 [2036] info: prefork: child states: IIIIIIII
0000044038fa53a2da764 [23915] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 20826
@4000000044038fa53afe64e4 [23915] info: spamd: processing message 
<20060227234741.HYGM15112.CFD.com@omta03ps>
 for lisa@YYYY.com:89
@4000000044038fa931c186a4 [25658] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 20828
@4000000044038fa932c9e654 [25658] info: spamd: processing message <002101c63bf8
$325c5d40$0300a8c0@KIDZBODZ1> for jackie@YYYY.com:89
@4000000044038fac026c819c [27055] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 20829
@4000000044038fac03211814 [27055] info: spamd: processing message (unknown) for 
wobbz@XXX.com:89
@4000000044038fac24a3e7dc [30469] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 20830
@4000000044038fac255ae39c [30469] info: spamd: processing message (unknown) for 
fritz@XXX.com:89
@4000000044038fb530b926f4 [8372] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3889> line 118.
@4000000044038fb530bac8ec [8372] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3889> line 118.
@4000000044038fb530bc91f4 [8372] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fb530bcf784 [8372] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3889> line 118.
@4000000044038fb530bd41bc [8372] warn: )
@4000000044038fb81da5d9f4 [9460] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3692> line 130.
@4000000044038fb81da76094 [9460] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3692> line 130.
@4000000044038fb81da91de4 [9460] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fb81da97f8c [9460] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3692> line 130.
@4000000044038fb81da9c9c4 [9460] warn: )
@4000000044038fba0cb83bb4 [19934] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3750> line 118.
@4000000044038fba0cb9c63c [19934] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3750> line 118.
@4000000044038fba0cbbcdc4 [19934] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fba0cbc2f6c [19934] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3750> line 118.
@4000000044038fba0cbc79a4 [19934] warn: )
@4000000044038fbe11b94504 [22999] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3627> line 77.
@4000000044038fbe11bacba4 [22999] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3627> line 77.
@4000000044038fbe11bccb5c [22999] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fbe11bd30ec [22999] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3627> line 77.
@4000000044038fbe11bd7b24 [22999] warn: )
@4000000044038fc13a7dfdcc [23915] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3625> line 67.
@4000000044038fc13a7f6144 [23915] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3625> line 67.
@4000000044038fc13a81592c [23915] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fc13a81bad4 [23915] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3625> line 67.
@4000000044038fc13a82050c [23915] warn: )
@4000000044038fc531e8255c [25658] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3628> line 294.
@4000000044038fc531e984ec [25658] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3628> line 294.
@4000000044038fc531eb80bc [25658] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fc531ebea34 [25658] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3628> line 294.
@4000000044038fc531ec346c [25658] warn: )
@4000000044038fc802b438fc [27055] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3598> line 36.
@4000000044038fc802b5429c [27055] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3598> line 36.
@4000000044038fc802b72ecc [27055] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fc802b77904 [27055] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3598> line 36.
@4000000044038fc802b7bf54 [27055] warn: )
@4000000044038fc824caa9bc [30469] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3631> line 36.
@4000000044038fc824cbedf4 [30469] error: child processing timeout 
at /usr/bin/spamd line 1088, <GEN3631> line 36.
@4000000044038fc824cda75c [30469] warn: rules: failed to run 
RAZOR2_CF_RANGE_E8_51_100 test, skipping:
@4000000044038fc824cdf964 [30469] warn: _(child processing timeout 
at /usr/bin/spamd line 1088, <GEN3631> line 36.
@4000000044038fc824ce2844 [30469] warn: )


After a spamd restart, it came back on with following message:
@400000004403a2410bb6023c [2036] info: spamd: server killed by SIGTERM, 
shutting down
@400000004403a241118fd3a4 [2036] info: spamd: server killed by SIGTERM, 
shutting down
@400000004403a241118fdb74 [2036] error: Can't call method "close" on an 
undefined value at /usr/bin/spamd line 1998 during gl
obal destruction.
@400000004403a241118fe344 [2036] error: Can't call method "close" on an 
undefined value at /usr/bin/spamd line 1998 during gl
obal destruction.
@400000004403a2432317b22c [12374] info: spamd: server started on port 783/tcp 
(running version 3.1.0)
@400000004403a243231a55f4 [12374] info: spamd: server pid: 12374
@400000004403a243239e280c [12374] info: spamd: server successfully spawned 
child process, pid 12442
@400000004403a2432414ea64 [12374] info: spamd: server successfully spawned 
child process, pid 12446
@400000004403a243249ab84c [12374] info: spamd: server successfully spawned 
child process, pid 12447
@400000004403a243250e4e24 [12374] info: spamd: server successfully spawned 
child process, pid 12449
@400000004403a243258723bc [12374] info: spamd: server successfully spawned 
child process, pid 12450
@400000004403a2432606a844 [12374] info: spamd: server successfully spawned 
child process, pid 12453
@400000004403a24326801634 [12374] info: spamd: server successfully spawned 
child process, pid 12456
@400000004403a24326f3c37c [12374] info: spamd: server successfully spawned 
child process, pid 12457
@400000004403a2432744f4a4 [12442] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 48138
@400000004403a24327540fd4 [12374] info: prefork: child states: BIIIIIIS
@400000004403a2432789930c [12446] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 48090
@400000004403a243287a9644 [12447] info: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 48139

This is a failry busy server, (4.5K messages per hour during a daily peak). But 
please let me know if any more info in needed for debugging.
Comment 59 Daryl C. W. O'Shea 2006-02-28 03:30:35 UTC
+1, I've had all these patches in production for a while.
Comment 60 Justin Mason 2006-02-28 10:42:37 UTC
Gambit: looks a lot like Razor was timing out, there.  Unfortunately the patch
cannot fix Razor if its servers start timing out. ;)

What were the effects of the crash?  Spamd and the child processes were still
running; were they refusing to process mail?  or processing slowly?  (the
latter would be normal if Razor was taking 30 seconds before timing out on each
mail.)  Were all mails being marked as ham but still delivered?

Can you decode the timestamps on those log lines?

Also, it's better to attach long log extracts as text attachments, instead of
posting them in-line.

Comment 61 Gambit 2006-03-02 00:39:39 UTC
Created attachment 3395 [details]
log
Comment 62 Gambit 2006-03-02 00:42:13 UTC
Hi,

During this time, spamd and all the children were all running (as the PS aux 
output from previous post), but they stopped processing mails. My spamc runs 
with a timeout of 30 seconds, so all the mails were passed on for further 
processing after that timeout.

Thanks.
Comment 63 Gambit 2006-03-02 00:44:00 UTC
Hi,

During this time, spamd and all the children were all running (as the PS aux 
output from previous post), but they stopped processing mails. My spamc runs 
with a timeout of 30 seconds, so all the mails were passed on for further 
processing after that timeout.

Thanks.
Comment 64 Justin Mason 2006-03-02 18:14:54 UTC
Gambit: that looks like some kind of separate bug, where a timeout is "bleeding
over" into another alarm handler; not related to this bug (althought presumably
you have this patch applied).  interesting.
Comment 65 Gambit 2006-03-03 01:57:13 UTC
Created attachment 3396 [details]
more logs
Comment 66 Gambit 2006-03-03 02:00:58 UTC
This problem again was shown on another server (even though I am restarting 
spamd twice daily from cron). I am using patch 3396 on top of fresh 3.1 source. 
The boxes are all RHES3.
Comment 67 Gambit 2006-03-03 02:02:58 UTC
Sorry, the last post was meant to state "patch 3381 on top of fresh 3.1 source" 
instead of "patch 3396 on top of fresh 3.1 source".
Comment 68 Justin Mason 2006-03-07 15:16:16 UTC
I'd appreciate if the "bleeding alarms" issue can be opened as a new BZ; this
one's getting unwieldy and will be hard to follow if another issue is thrown in
here.
Comment 69 Daryl C. W. O'Shea 2006-03-08 08:27:47 UTC
I'd open another bug, but it'd just say that you forgot to run "svn add" for
Timeout.pm for attachment 3380 [details].

I started with a clean copy of 3.1 branch and patched it with 3380, but didn't
remove the existing installed copy.  Since Timeout.pm is missing from 3380 I'm
still using the inital Timeout.pm that is missing the triple underscore fix from
r374162, which fixed the alarm logging issue.


So... I revoke my +1 for 3380 and vote +1 for an identical patch with the
addition of Timeout.pm as it is in r374162.  :)
Comment 70 Theo Van Dinter 2006-03-08 19:08:35 UTC
+1

I've only done a small amount of testing, but I don't see the problem as described so ...

btw, I don't think we need a new patch, let's just agree to do:

svn cp https://svn.apache.org/repos/asf/spamassassin/trunk/lib/Mail/SpamAssassin/Timeout.pm lib/
Mail/SpamAssassin/Timeout.pm

after the patch is applied and before the commit.
Comment 71 Sidney Markowitz 2006-03-09 19:54:36 UTC
+1 on patch 3380 plus svn add of lib/Mail/SpamAssassin/Timeout.pm from trunk

And in the interest of getting things cleaned up for a 3.1.1 release:

Committed to 3.1 branch revision 384590.