SA Bugzilla – Bug 4696
[review] tcp timeout -> SpamdForkScaling.pm line 195
Last modified: 2006-03-09 11:54:36 UTC
Excuse my bad english, I am French... I upgrade SA to 3.1.0 and now spamd died... I search the same bug everywhere and I don't find it. I write a sh script to restart automaticaly spamd, but it's not a good solution... This is the seven line of log : Nov 21 05:35:30 ns2299 spamd[1153]: prefork: child states: II Nov 21 05:35:53 ns2299 spamd[1153]: tcp timeout at /usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Nov 21 05:35:55 ns2299 spamd[1153]: tcp timeout at /usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Nov 21 05:35:55 ns2299 spamd[1153]: prefork: select returned undef! recovering Nov 21 05:37:37 ns2299 spamd[1153]: tcp timeout at /usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Nov 21 05:37:41 ns2299 spamd[1153]: tcp timeout at /usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Nov 21 05:38:54 ns2299 spamc[19745]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused
The same problem again : Nov 22 05:54:18 ns2299 spamd[12426]: tcp timeout at /usr/lib/perl5/site_perl/5.8.3/Mail/SpamAssassin/SpamdForkScaling.pm line 195. I saw this problem since twelve days always between 5:30 and 5:59 ! Very strange... My configuration : RedHat server, Kernel 2.4 with grsec Greg
Same problem here, occured only once so far Dec 5 00:15:04 fedora spamd[29470]: tcp timeout at /usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Dec 5 00:15:04 fedora spamd[29470]: tcp timeout at /usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Dec 5 00:28:22 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused Dec 5 00:28:23 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#2 of 3): Connection refused Dec 5 00:28:24 fedora spamc[8763]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#3 of 3): Connection refused CentOS 3.6 kernel 2.4.21-37.EL
this may be related to bug 4693 - as they both involve the 'tcp timeout' being triggered... just slightly different outcomes.
Hello I don't think it's the same problem. In fact the outcome is different... I'm very embarrassed with this bug... I must restart spamd every day...(In reply to comment #3) > this may be related to bug 4693 - as they both involve the 'tcp timeout' being > triggered... just slightly different outcomes.
> I saw this problem since twelve days always between 5:30 and 5:59 ! Do you have a cleaning person who unplugs a router to plug in the floor cleaner every day? :-) > Very strange... Yes it is. Do you get the same results if you run spamd with the --round-robin option?
indeed, I don't see any correlation with bug 4693 I run SA 3.1.0 since it's release date, it had been running for about 45 days when the crash occured. I didn't make any change/update to perl or anything else regarding postfix or spamassassin in a while.. imho it's a spamassassin bug (In reply to comment #4 and #5) > Hello I don't think it's the same problem. In fact the outcome is different... > I'm very embarrassed with this bug... I must restart spamd every day...(In reply > to comment #3) > > this may be related to bug 4693 - as they both involve the 'tcp timeout' being > > triggered... just slightly different outcomes. > >
I don't think these are related to the other "tcp timeout" bug. Most often, when spamd dies at the same time every day or every week, in the middle of the night, I would tend to suspect syslog cycling. Spamd needs a "kill -1" when syslogd is restarted... If that's not it, I would suggest trying to capture some strace output; strace -fp of the process that issues the warning.
(In reply to comment #7) > I don't think these are related to the other "tcp timeout" bug. > > Most often, when spamd dies at the same time every day or every week, in the > middle of the night, I would tend to suspect syslog cycling. Spamd needs a > "kill -1" when syslogd is restarted... > Very unlikely. For me this happened at the middle of the day after spamd running for several weeks. Here's my logs: Signal 11 was received for spamd[4481] at 12:06:22 Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBBIBB Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIIBBKBB Dec 29 12:06:18 spamd[4481]: spamd: handled cleanup of child pid 8868 due to SIGCHLD Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBKBB Dec 29 12:06:18 spamd[4481]: prefork: child states: BIBIBBKIB Dec 29 12:06:18 spamd[4481]: prefork: child states: BIIIBBKKB Dec 29 12:06:18 spamd[4481]: spamd: handled cleanup of child pid 27700 due to SIGCHLD Dec 29 12:06:18 spamd[4481]: prefork: child states: BIIKBBKB Dec 29 12:06:22 spamd[4481]: tcp timeout at /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Dec 29 12:06:22 spamd[4481]: tcp timeout at /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Dec 29 12:06:22 spamd[4481]: spamd: handled cleanup of child pid 6885 due to SIGCHLD Dec 29 12:06:22 spamd[4481]: spamd: handled cleanup of child pid 22408 due to SIGCHLD Dec 29 12:06:22 spamd[4481]: prefork: child states: BIIBBB Dec 29 12:06:24 spamd[30389]: spamd: clean message (-3.9/4.8) for ... in 7.7 seconds, 5490 bytes. Dec 29 12:06:24 spamd[19326]: spamd: clean message (-3.9/4.8) for ... in 7.9 seconds, 5487 bytes. Dec 29 12:06:24 spamd[1513]: spamd: clean message (-3.9/4.8) for ... in 7.5 seconds, 3883 bytes. Dec 29 12:06:25 spamd[9107]: spamd: clean message (-4.2/4.8) for ... in 7.8 seconds, 3331 bytes. Dec 29 12:06:25 spamc[9210]: connect(AF_UNIX) to spamd spamd.socket failed: Connection refused
Similar problem here with it dying at random during the day. No recycling of logs or anything here. Linux kernel 2.6.12.2 with SA 3.1 and Exim 3.36 See below log piece I captured from a start at 18:25 until it died on its own at 1853 Jan 9 18:22:45 eagle spamd[18307]: logger: removing stderr method Jan 9 18:25:09 eagle dccifd[18383]: write(MTA socket,89): Broken pipe Jan 9 18:25:24 eagle spamd[18313]: spamd: server started on port 783/tcp (runni ng version 3.1.0) Jan 9 18:25:24 eagle spamd[18313]: spamd: server pid: 18313 Jan 9 18:25:24 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18414 Jan 9 18:25:24 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18415 Jan 9 18:25:24 eagle spamd[18414]: spamd: connection from localhost [127.0.0.1] at port 59378 Jan 9 18:25:24 eagle spamd[18414]: spamd: processing message <43c2fe5f@happypat tern.com> for mail:561 Jan 9 18:25:24 eagle spamd[18313]: prefork: child states: BI Jan 9 18:25:24 eagle spamd[18415]: spamd: connection from localhost [127.0.0.1] at port 59381 Jan 9 18:25:24 eagle spamd[18415]: spamd: processing message <o4vD3rkVl.v3YxK2d 9Af.nqkiRY1@spray.se> for mail:561 Jan 9 18:25:39 eagle spamd[18414]: spamd: identified spam (122.9/5.0) for mail: 561 in 15.3 seconds, 5772 bytes. Jan 9 18:25:40 eagle spamd[18414]: spamd: result: Y 122 - DCC_CHECK,DIGEST_MULT IPLE,FRONTPAGE,HTML_MESSAGE,HTML_TAG_BALANCE_BODY,RAZOR2_CF_RANGE_51_100,RAZOR2_ CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_EN_A_9XX_1,SARE_H TML_HTML_BEFORE,SARE_SXLIFE,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_WS_SURBL,USER_IN _BLACKLIST,WLS_URI_OPT_1395 scantime=15.3,size=5772,user=mail,uid=561,required_s core=5.0,rhost=localhost,raddr=127.0.0.1,rport=59378,mid=<43c2fe5f@happypattern. com>,autolearn=disabled Jan 9 18:25:50 eagle spamd[18414]: __alarm__ Jan 9 18:25:50 eagle spamd[18414]: __alarm__ Jan 9 18:25:50 eagle spamd[18313]: prefork: child states: BB Jan 9 18:25:50 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18433 Jan 9 18:25:50 eagle spamd[18313]: prefork: child states: BBB Jan 9 18:25:50 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18434 Jan 9 18:25:50 eagle spamd[18313]: prefork: child states: BBBI Jan 9 18:25:50 eagle spamd[18433]: spamd: connection from localhost [127.0.0.1] at port 59385 Jan 9 18:25:50 eagle spamd[18433]: spamd: processing message <Xzugd31hP.N12uh2t D.iK9niF@postmaster.co.uk> for mail:561 Jan 9 18:25:50 eagle spamd[18414]: spamd: connection from localhost [127.0.0.1] at port 59384 Jan 9 18:25:50 eagle spamd[18414]: spamd: processing message <JH4gKuqgma.dPnxot NlAS.SVXzzQw@mweb.com.cn> for mail:561 Jan 9 18:26:13 eagle spamd[18434]: spamd: connection from localhost [127.0.0.1] at port 46548 Jan 9 18:26:13 eagle spamd[18434]: spamd: processing message <000001c615a6$a397 6400$0100007f@localhost> for mail:561 Jan 9 18:26:32 eagle spamd[18415]: spamd: identified spam (25.1/5.0) for mail:5 61 in 67.9 seconds, 772 bytes. Jan 9 18:26:32 eagle spamd[18415]: spamd: result: Y 25 - DCC_CHECK,DIGEST_MULTI PLE,FROM_LOCAL_NOVOWEL,INFO_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100 ,RAZOR2_CHECK,SARE_URI_RAW_ONLY,URIBL_JP_SURBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI _NO_WWW_INFO_CGI,WLS_URI_OPT_1475 scantime=67.9,size=772,user=mail,uid=561,requi red_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=59381,mid=<o4vD3rkVl.v3YxK2d 9Af.nqkiRY1@spray.se>,autolearn=disabled Jan 9 18:26:50 eagle spamd[18415]: __alarm__ Jan 9 18:26:50 eagle spamd[18415]: __alarm__ Jan 9 18:26:54 eagle spamd[18313]: prefork: child states: BBBB Jan 9 18:26:54 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18464 Jan 9 18:26:54 eagle spamd[18313]: prefork: child states: BBBBI Jan 9 18:26:54 eagle spamd[18415]: spamd: connection from localhost [127.0.0.1] at port 46551 Jan 9 18:26:55 eagle spamd[18415]: spamd: processing message <001501c6157c$2a66 6740$8200a8c0@David> for mail:561 Jan 9 18:27:12 eagle spamd[18414]: spamd: identified spam (32.5/5.0) for mail:5 61 in 82.2 seconds, 750 bytes. Jan 9 18:27:12 eagle spamd[18414]: spamd: result: Y 32 - INFO_TLD,RAZOR2_CF_RAN GE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_URI_RAW_ONLY,URIBL_AB_SURB L,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NO_W WW_INFO_CGI,WLS_URI_OPT_3272 scantime=82.2,size=750,user=mail,uid=561,required_s core=5.0,rhost=localhost,raddr=127.0.0.1,rport=59384,mid=<JH4gKuqgma.dPnxotNlAS. SVXzzQw@mweb.com.cn>,autolearn=disabled Jan 9 18:27:23 eagle spamd[18414]: __alarm__ Jan 9 18:27:23 eagle spamd[18414]: __alarm__ Jan 9 18:27:29 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:27:29 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:27:29 eagle spamd[18313]: prefork: select returned undef! recovering Jan 9 18:27:41 eagle spamd[18433]: spamd: identified spam (26.4/5.0) for mail:5 61 in 110.7 seconds, 805 bytes. Jan 9 18:27:41 eagle spamd[18433]: spamd: result: Y 26 - DCC_CHECK,FROM_LOCAL_N OVOWEL,INFO_TLD,SARE_URI_RAW_ONLY,URIBL_JP_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_ WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_1475 scantime=110.7,size=805,user=mail, uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=59385,mid=<Xzug d31hP.N12uh2tD.iK9niF@postmaster.co.uk>,autolearn=disabled Jan 9 18:27:47 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:27:47 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:27:47 eagle spamd[18313]: prefork: select returned undef! recovering Jan 9 18:27:48 eagle spamd[18313]: prefork: child states: IBBBI Jan 9 18:27:51 eagle spamd[18414]: spamd: connection from localhost [127.0.0.1] at port 46555 Jan 9 18:27:53 eagle spamd[18414]: spamd: processing message <191690-2200612100 271598@PSU.EDU> for mail:561 Jan 9 18:27:57 eagle spamd[18433]: __alarm__ Jan 9 18:27:57 eagle spamd[18433]: __alarm__ Jan 9 18:27:57 eagle spamd[18313]: prefork: child states: BBIBI Jan 9 18:28:25 eagle spamd[18433]: spamd: connection from localhost [127.0.0.1] at port 46556 Jan 9 18:28:26 eagle spamd[18433]: spamd: processing message <001101c60f1c$66c0 f360$0200a8c0@pc> for mail:561 Jan 9 18:28:42 eagle spamd[18434]: spamd: identified spam (38.4/5.0) for mail:5 61 in 149.0 seconds, 14763 bytes. Jan 9 18:28:43 eagle spamd[18434]: spamd: result: Y 38 - DCC_CHECK,DIGEST_MULTI PLE,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_ 51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RM_bpoem_InstantDL,SARE_OEM_AND_OT HER,SARE_OEM_A_1,SARE_OEM_NEW_TITLES,SARE_OEM_PRODS_1,SARE_OEM_PRODS_FEW,SARE_OE M_S_PRICE,SARE_OEM_UPPER_EYE,SARE_PRODS_LOTS,SARE_PRODUCTS_02,SARE_PRODUCTS_03,S ARE_PRODUCTS_04,SARE_SUB_CHEAP_SW,URIBL_JP_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_ WS_SURBL,WLS_URI_OPT_2170 scantime=149.0,size=14763,user=mail,uid=561,required_s core=5.0,rhost=localhost,raddr=127.0.0.1,rport=46548,mid=<000001c615a6$a3976400$ 0100007f@localhost>,autolearn=disabled Jan 9 18:28:59 eagle spamd[18464]: spamd: connection from localhost [127.0.0.1] at port 46559 Jan 9 18:29:24 eagle spamd[18434]: __alarm__ Jan 9 18:29:28 eagle spamd[18434]: __alarm__ Jan 9 18:29:28 eagle spamd[18313]: prefork: child states: BBBIB Jan 9 18:29:53 eagle spamd[18464]: spamd: processing message <dKqBnX.squirrel@2 18.208.51.3> for mail:561 Jan 9 18:30:35 eagle spamd[18433]: spamd: identified spam (47.5/5.0) for mail:5 61 in 130.4 seconds, 3461 bytes. Jan 9 18:30:37 eagle spamd[18433]: spamd: result: Y 47 - DATE_IN_PAST_06_12,DCC _CHECK,DIGEST_MULTIPLE,DRUGS_ERECTILE,DRUG_DOSAGE,DRUG_ED_CAPS,FORGED_YAHOO_RCVD ,HTML_FONT_BIG,HTML_FONT_SIZE_LARGE,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,INFO_TLD,M SGID_OUTLOOK_INVALID,NO_REAL_NAME,ONLINE_PHARMACY,RAZOR2_CF_RANGE_51_100,RAZOR2_ CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,SARE_SUB_AS_LOW_AS,SAR E_SUPERVIAGRA,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SU RBL,URIBL_WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_1744 scantime=130.4,size=3461 ,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=4655 6,mid=<001101c60f1c$66c0f360$0200a8c0@pc>,autolearn=disabled Jan 9 18:30:40 eagle spamd[18434]: spamd: connection from localhost [127.0.0.1] at port 46566 Jan 9 18:30:47 eagle spamd[18313]: prefork: child states: BBIBB Jan 9 18:30:54 eagle spamd[18434]: spamd: processing message <j$39-27i184-$1g@f nsdvt.mqon> for mail:561 Jan 9 18:31:07 eagle spamd[18433]: spamd: connection from localhost [127.0.0.1] at port 34888 Jan 9 18:31:09 eagle spamd[18433]: spamd: processing message <7-10168083-19L6LK JNFKCLbi4Zru@em.prewards.com> for mail:561 Jan 9 18:31:27 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:31:28 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:31:28 eagle spamd[18313]: prefork: select returned undef! recovering Jan 9 18:31:51 eagle spamd[18414]: spamd: clean message (1.3/5.0) for mail:561 in 234.4 seconds, 3145 bytes. Jan 9 18:32:05 eagle spamd[18414]: spamd: result: . 1 - HTML_IMAGE_ONLY_16,HTM L_MESSAGE,MIME_BOUND_NEXTPART,MISSING_HEADERS,SARE_HTML_BODY_2SP,TO_CC_NONE scan time=234.4,size=3145,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr= 127.0.0.1,rport=46555,mid=<191690-2200612100271598@PSU.EDU>,autolearn=disabled Jan 9 18:32:07 eagle spamd[18414]: __alarm__ Jan 9 18:32:08 eagle spamd[18414]: __alarm__ Jan 9 18:32:08 eagle spamd[18313]: prefork: child states: IBBBB Jan 9 18:32:08 eagle spamd[18415]: spamd: clean message (0.1/5.0) for mail:561 in 308.9 seconds, 9149 bytes. Jan 9 18:32:09 eagle spamd[18415]: spamd: result: . 0 - HTML_MESSAGE,TW_TB sca ntime=308.9,size=9149,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr =127.0.0.1,rport=46551,mid=<001501c6157c$2a666740$8200a8c0@David>,autolearn=disa bled Jan 9 18:32:09 eagle spamd[18313]: prefork: child states: IIBBB Jan 9 18:32:27 eagle spamd[18414]: spamd: connection from localhost [127.0.0.1] at port 34889 Jan 9 18:32:32 eagle spamd[18414]: spamd: processing message <158c195h.6664687@ yahoo.com> for mail:561 Jan 9 18:32:45 eagle spamd[18433]: spamd: identified spam (23.8/5.0) for mail:5 61 in 97.9 seconds, 4857 bytes. Jan 9 18:32:46 eagle spamd[18433]: spamd: result: Y 23 - BAD_CREDIT,DCC_CHECK,H TML_IMAGE_RATIO_02,HTML_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_ RANGE_E4_51_100,SARE_MSGID_DDDASH,SARE_SUB_POOR_CREDIT,URIBL_JP_SURBL,URIBL_OB_S URBL,URIBL_SBL,URIBL_WS_SURBL,WLS_URI_OPT_14 scantime=97.9,size=4857,user=mail,u id=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=34888,mid=<7-101 68083-19L6LKJNFKCLbi4Zru@em.prewards.com>,autolearn=disabled Jan 9 18:32:48 eagle spamd[18415]: spamd: connection from localhost [127.0.0.1] at port 34893 Jan 9 18:32:51 eagle spamd[18415]: spamd: processing message <200601100031.k0A0 VLnG016456@smtp30.hccnet.nl> for mail:561 Jan 9 18:33:36 eagle spamd[18434]: spamd: identified spam (25.3/5.0) for mail:5 61 in 174.2 seconds, 10711 bytes. Jan 9 18:34:01 eagle spamd[18434]: spamd: result: Y 25 - DCC_CHECK,FORGED_MUA_E UDORA,FROM_HAS_MIXED_NUMS,FROM_ILLEGAL_CHARS,HEAD_ILLEGAL_CHARS,HTML_IMAGE_RATIO _02,HTML_MESSAGE,HTTP_EXCESSIVE_ESCAPES,MIME_HTML_ONLY,MIME_HTML_ONLY_MULTI,MPAR T_ALT_DIFF,RCVD_DOUBLE_IP_LOOSE,RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO,REPTO_QU OTE_QUALCOMM,SARE_HTML_TITLE_1WD,SUBJ_ALL_CAPS,SUBJ_ILLEGAL_CHARS,URIBL_SBL scan time=174.2,size=10711,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr =127.0.0.1,rport=46566,mid=<j$39-27i184-$1g@fnsdvt.mqon>,autolearn=disabled Jan 9 18:34:05 eagle spamd[18433]: __alarm__ Jan 9 18:34:07 eagle spamd[18433]: __alarm__ Jan 9 18:34:07 eagle spamd[18313]: prefork: child states: BBIBB Jan 9 18:34:08 eagle spamd[18414]: spamd: identified spam (11.2/5.0) for mail:5 61 in 71.1 seconds, 12360 bytes. Jan 9 18:34:08 eagle spamd[18414]: spamd: result: Y 11 - DATE_IN_PAST_96_XX,FOR GED_MSGID_YAHOO,HELO_DYNAMIC_COMCAST,HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_SHORT_ LINK_IMG_1,MIME_HTML_ONLY,SARE_UNF scantime=71.1,size=12360,user=mail,uid=561,re quired_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=34889,mid=<158c195h.66646 87@yahoo.com>,autolearn=disabled Jan 9 18:34:08 eagle spamd[18313]: prefork: child states: BBIIB Jan 9 18:34:08 eagle spamd[18414]: __alarm__ Jan 9 18:34:09 eagle spamd[18414]: __alarm__ Jan 9 18:34:09 eagle spamd[18313]: prefork: child states: IBIIB Jan 9 18:34:12 eagle spamd[18313]: spamd: handled cleanup of child pid 18434 du e to SIGCHLD Jan 9 18:34:13 eagle spamd[18313]: prefork: child states: IBIB Jan 9 18:34:26 eagle spamd[18464]: child processing timeout at /usr/local/bin/s pamd line 1085, <GEN20> line 36. Jan 9 18:34:28 eagle spamd[18464]: child processing timeout at /usr/local/bin/s pamd line 1085, <GEN20> line 36. Jan 9 18:34:35 eagle spamd[18414]: spamd: connection from localhost [127.0.0.1] at port 34897 Jan 9 18:34:37 eagle spamd[18414]: spamd: processing message <E1Ew7Rr-0004si-00 @eagle.netwrx1.com> for mail:561 Jan 9 18:34:37 eagle spamd[18433]: spamd: connection from localhost [127.0.0.1] at port 34898 Jan 9 18:34:37 eagle spamd[18433]: spamd: processing message <dpuvat+gdo9@eGrou ps.com> for mail:561 Jan 9 18:35:02 eagle spamd[18464]: spamd: identified spam (25.9/5.0) for mail:5 61 in 364.8 seconds, 1867 bytes. Jan 9 18:35:02 eagle spamd[18464]: spamd: result: Y 25 - HTML_FONT_FACE_BAD,HTM L_MESSAGE,MIME_HTML_ONLY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2 _CHECK,SARE_UNG,SARE_UNQ,TW_OQ,TW_QH,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURB L,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NOVOWEL,WLS_URI_OPT_3415 scantime=364.8,size =1867,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport =46559,mid=<dKqBnX.squirrel@218.208.51.3>,autolearn=disabled Jan 9 18:35:10 eagle spamd[18415]: spamd: clean message (0.0/5.0) for mail:561 in 143.6 seconds, 3130 bytes. Jan 9 18:35:15 eagle spamd[18415]: spamd: result: . 0 - UNPARSEABLE_RELAY scan time=143.6,size=3130,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr= 127.0.0.1,rport=34893,mid=<200601100031.k0A0VLnG016456@smtp30.hccnet.nl>,autolea rn=disabled Jan 9 18:35:34 eagle spamd[18415]: __alarm__ Jan 9 18:35:34 eagle spamd[18415]: __alarm__ Jan 9 18:35:35 eagle spamd[18415]: spamd: connection from localhost [127.0.0.1] at port 34904 Jan 9 18:35:37 eagle spamd[18313]: prefork: child states: BBBB Jan 9 18:35:42 eagle spamd[18415]: spamd: processing message <001101c60f1c$66c0 f360$0200a8c0@pc> for mail:561 Jan 9 18:35:57 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18903 Jan 9 18:36:03 eagle spamd[18464]: __alarm__ Jan 9 18:36:04 eagle spamd[18464]: __alarm__ Jan 9 18:36:05 eagle spamd[18313]: prefork: child states: BBBBB Jan 9 18:36:06 eagle spamd[18313]: spamd: server successfully spawned child pro cess, pid 18921 Jan 9 18:36:09 eagle spamd[18903]: spamd: connection from localhost [127.0.0.1] at port 34905 Jan 9 18:36:18 eagle spamd[18464]: spamd: connection from localhost [127.0.0.1] at port 34906 Jan 9 18:36:21 eagle spamd[18313]: prefork: child states: BBBBBI Jan 9 18:36:21 eagle spamd[18464]: spamd: processing message <KVKFJSEOCGRPSDDQU FZSZP@yahoo.com> for mail:561 Jan 9 18:37:34 eagle spamd[18903]: spamd: processing message <000001c615a7$d8f0 8180$0100007f@localhost> for mail:561 Jan 9 18:38:04 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:38:11 eagle spamd[18313]: tcp timeout at /usr/local/lib/perl5/site_per l/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Jan 9 18:38:11 eagle spamd[18313]: prefork: select returned undef! recovering Jan 9 18:38:15 eagle spamd[18433]: spamd: clean message (0.0/5.0) for mail:561 in 210.0 seconds, 3170 bytes. Jan 9 18:38:18 eagle spamd[18433]: spamd: result: . 0 - scantime=210.0,size=3 170,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=3 4898,mid=<dpuvat+gdo9@eGroups.com>,autolearn=disabled Jan 9 18:38:28 eagle spamd[18433]: __alarm__ Jan 9 18:38:32 eagle spamd[18433]: __alarm__ Jan 9 18:38:42 eagle spamd[18414]: spamd: identified spam (10.1/5.0) for mail:5 61 in 233.4 seconds, 1581 bytes. Jan 9 18:39:19 eagle spamd[18414]: spamd: result: Y 10 - DCC_CHECK,MAILTO_TO_SP AM_ADDR,MIME_BASE64_TEXT,MISSING_MIMEOLE,MSGID_FROM_MTA_ID,RCVD_NUMERIC_HELO,SUB J_ILLEGAL_CHARS scantime=233.4,size=1581,user=mail,uid=561,required_score=5.0,rh ost=localhost,raddr=127.0.0.1,rport=34897,mid=<E1Ew7Rr-0004si-00@eagle.netwrx1.c om>,autolearn=disabled Jan 9 18:40:09 eagle spamd[18464]: spamd: identified spam (51.0/5.0) for mail:5 61 in 143.3 seconds, 2071 bytes. Jan 9 18:40:44 eagle spamd[18464]: spamd: result: Y 50 - DATE_IN_FUTURE_96_XX,D ATE_SPAMWARE_Y2K,FORGED_MUA_OUTLOOK,FORGED_OUTLOOK_HTML,FORGED_YAHOO_RCVD,FROM_I LLEGAL_CHARS,FROM_LOCAL_NOVOWEL,HEAD_ILLEGAL_CHARS,HTML_IMAGE_ONLY_12,HTML_IMAGE _RATIO_02,HTML_MESSAGE,HTTP_EXCESSIVE_ESCAPES,MIME_BOUND_DD_DIGITS,MIME_HTML_ONL Y,MIME_HTML_ONLY_MULTI,MISSING_MIMEOLE,MPART_ALT_DIFF,MSGID_SPAM_CAPS,MSGID_YAHO O_CAPS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,RCVD_NUMERI C_HELO,REPTO_QUOTE_YAHOO,SARE_BOUNDARY_07,SUBJ_ILLEGAL_CHARS,UNPARSEABLE_RELAY,U RIBL_JP_SURBL,URIBL_WS_SURBL,WLS_URI_OPT_2599,X_PRIORITY_HIGH scantime=143.3,siz e=2071,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rpor t=34906,mid=<KVKFJSEOCGRPSDDQUFZSZP@yahoo.com>,autolearn=disabled Jan 9 18:41:07 eagle spamd[18464]: __alarm__ Jan 9 18:41:08 eagle spamd[18464]: __alarm__ Jan 9 18:41:08 eagle spamd[18414]: __alarm__ Jan 9 18:41:09 eagle spamd[18414]: __alarm__ Jan 9 18:41:09 eagle spamd[18415]: spamd: identified spam (48.5/5.0) for mail:5 61 in 195.3 seconds, 3443 bytes. Jan 9 18:41:38 eagle spamd[18415]: spamd: result: Y 48 - DATE_IN_PAST_06_12,DCC _CHECK,DIGEST_MULTIPLE,DRUGS_ERECTILE,DRUGS_PAIN,DRUGS_PAIN_EREC,DRUG_DOSAGE,DRU G_ED_CAPS,FORGED_YAHOO_RCVD,HTML_FONT_BIG,HTML_FONT_SIZE_LARGE,HTML_MESSAGE,HTML _TAG_EXIST_TBODY,INFO_TLD,MSGID_OUTLOOK_INVALID,NO_REAL_NAME,ONLINE_PHARMACY,RAZ OR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_C HECK,SARE_SUB_AS_LOW_AS,SARE_SUPERVIAGRA,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_ SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,URI_NO_WWW_INFO_CGI,WLS_URI_OPT_17 44 scantime=195.3,size=3443,user=mail,uid=561,required_score=5.0,rhost=localhost ,raddr=127.0.0.1,rport=34904,mid=<001101c60f1c$66c0f360$0200a8c0@pc>,autolearn=d isabled Jan 9 18:41:39 eagle spamd[18903]: child processing timeout at /usr/local/bin/s pamd line 1085, <GEN23> line 147. Jan 9 18:41:39 eagle spamd[18903]: child processing timeout at /usr/local/bin/s pamd line 1085, <GEN23> line 147. Jan 9 18:41:39 eagle spamd[18903]: child processing timeout at /usr/local/bin/s pamd line 1085, <GEN23> line 147. Jan 9 18:42:09 eagle spamd[18903]: spamd: identified spam (35.4/5.0) for mail:5 61 in 364.8 seconds, 14725 bytes. Jan 9 18:44:25 eagle spamd[18903]: spamd: result: Y 35 - HTML_MESSAGE,HTML_TAG_ EXIST_TBODY,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_ 51_100,RAZOR2_CHECK,RM_bpoem_InstantDL,SARE_OEM_AND_OTHER,SARE_OEM_A_1,SARE_OEM_ NEW_TITLES,SARE_OEM_PRODS_1,SARE_OEM_PRODS_FEW,SARE_OEM_S_PRICE,SARE_OEM_UPPER_E YE,SARE_PRODS_LOTS,SARE_PRODUCTS_02,SARE_PRODUCTS_03,SARE_PRODUCTS_04,URIBL_JP_S URBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL,WLS_URI_OPT_2170 scantime=364.8,siz e=14725,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rpo rt=34905,mid=<000001c615a7$d8f08180$0100007f@localhost>,autolearn=disabled Jan 9 18:47:04 eagle spamd[18921]: prefork: sysread(12) failed after 300 secs a t /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm lin e 501. Jan 9 18:48:21 eagle spamd[18433]: prefork: sysread(9) failed after 300 secs at /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 501. Jan 9 18:48:34 eagle spamd[18903]: __alarm__ Jan 9 18:48:36 eagle spamd[18464]: prefork: sysread(11) failed after 300 secs a t /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm lin e 501. Jan 9 18:48:37 eagle spamd[18414]: prefork: sysread(7) failed after 300 secs at /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 501. Jan 9 18:48:38 eagle spamd[18415]: prefork: sysread(8) failed after 300 secs at /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm line 501. Jan 9 18:48:39 eagle spamd[18903]: __alarm__ Jan 9 18:53:03 eagle spamc[19415]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#1 of 3): Connection refused Jan 9 18:53:04 eagle spamc[19415]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#2 of 3): Connection refused Jan 9 18:53:04 eagle spamc[19420]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#1 of 3): Connection refused Jan 9 18:53:05 eagle spamc[19415]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#3 of 3): Connection refused Jan 9 18:53:05 eagle spamc[19420]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#2 of 3): Connection refused Jan 9 18:53:06 eagle spamc[19415]: connection attempt to spamd aborted after 3 retries Jan 9 18:53:06 eagle spamc[19420]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#3 of 3): Connection refused Jan 9 18:53:07 eagle spamc[19420]: connection attempt to spamd aborted after 3 retries Jan 9 18:53:08 eagle spamc[19429]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#1 of 3): Connection refused Jan 9 18:53:09 eagle spamc[19429]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#2 of 3): Connection refused Jan 9 18:53:09 eagle spamc[19437]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#1 of 3): Connection refused Jan 9 18:53:10 eagle spamc[19429]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#3 of 3): Connection refused Jan 9 18:53:10 eagle spamc[19437]: connect(AF_INET) to spamd at 127.0.0.1 faile d, retrying (#2 of 3): Connection refused Jan 9 18:53:11 eagle spamc[19429]: connection attempt to spamd aborted after 3 retries
Created attachment 3328 [details] mail log traces w/ -Dprefork strace files for bug as requested
those logs are a little odd. spamd-maillog: Jan 10 23:05:07 eagle spamd[24378]: spamd: server successfully spawned child pro cess, pid 26130 spamd-trace2: 24378 23:04:49 fork() = 26130 26130 23:04:49 getpid() = 26130 26130 23:04:50 open("/etc/passwd", O_RDONLY) = 15 Is the maillog from a different machine, with clock skew between the two? otherwise I can't see how it took syslogd 17 seconds to syslog that message. however -- It *could* be something to do with a slow/remote syslogd slowing up logging, or extremely heavy load. I notice this in spamd-trace2... 26130 23:04:50 write(4, "<23>spamd[26130]: prefork: sysre"..., 69 <unfinished ...> 26130 23:04:52 <... write resumed> ) = 69 in other words it took 2 seconds just for the syslogd write() call to complete. that's a *long* time for logging! actually, looking further into the strace it looks increasingly like load. e.g. 26130 23:05:04 accept(5, {sin_family=AF_INET, sin_port=htons(45076), sin_addr=inet_a ddr("127.0.0.1")}}, [16]) = 13 26130 23:05:04 write(14, "B\0\0f\22\n", 6) = 6 26130 23:05:07 read(13, " 1 Office Pro 2003 \n 2 Adobe Ph"..., 1024) = 1024 26130 23:05:17 read(13, "height=3D23></A> <BR> <BR> <B>Av"..., 1024) = 556 26130 23:05:17 read(13, "height=3D23></A> <BR> <BR> <B>Av"..., 1024) = 556 26130 23:06:25 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0 now: >1 minute without performing system calls? that's a busy CPU working on parsing. in general, that scan process carries on as normal, but very slowly; then: 26130 23:10:07 --- SIGALRM (Alarm clock) --- So it really is timing out, simply due to extremely high CPU load, it looks like. I would look at why it takes 2 seconds to log a syslog message; that's bad. --j.
The logs are off the same machine here. Without SA 3.1 running system load is usually < 4 and usually under 1. However running spamd 3.1 will drive it up to near 10 or more. I came home to see it running at 84 (that's not a typo) and once I killed SA we're back to 2.7 in a matter of 5 minutes. SA 3.0.4 did not behave this way. I haven't changed anything in the way SA is configured and at first when you start it it will behave normally but over time the CPU load will climb until the program fails and is no longer running. I'm not a programmer here so I don't know what more to add, I'd be happy to try or test anything you like given clear instructions.
from the mailing list: 'Things are looking back to normal here with 4.60 and 3.1 now after some help by Larry Rosenman <ler@lerctr.org>, turns out that one of the SARE rules (I'm thinking one of the blacklist URI ones about 13 meg in size (sorry I don't recall the name of it at the moment) may have been causing our grief along with not being on Perl 5.8.7 which we also did here. After cleaning up those 2 items it is looking good so far.' IOW, George's case was just a load issue caused by the sa-blacklist and/or sa-blacklist-uri rulesets, not a bug.
(In reply to comment #13) > from the mailing list: > > 'Things are looking back to normal here with 4.60 and 3.1 now after > some help by Larry Rosenman <ler@lerctr.org>, turns out that one of > the SARE rules (I'm thinking one of the blacklist URI ones about 13 > meg in size (sorry I don't recall the name of it at the moment) may > have been causing our grief along with not being on Perl 5.8.7 which > we also did here. After cleaning up those 2 items it is looking good > so far.' > > IOW, George's case was just a load issue caused by the sa-blacklist and/or > sa-blacklist-uri rulesets, not a bug. So is there a solution to this problem?
Isn't this a duplicate of Bug #4563 ?
'So is there a solution to this problem?' George's bug -- yes, get rid of large rulesets. This, I think, is the problem for people who are seeing it happen at all times of day. Greg's bug -- not necessarily, since they haven't commented if this fixes their problems too, and also there's the thing that they're only seeing it at a certain time of day.
I have eliminated all occurances of this error by upgrading the memory on my mailserver. Is there any way that the occurance of this is directly related to the amount of available memory the system has? I upgraded the RAM (previous RAM all tests out as good RAM), and now I'm not having this error any more. I'm curious as to the amount of available memory that people have, who are also experiencing this problem.
'I have eliminated all occurances of this error by upgrading the memory on my mailserver.' note that it *is* load-related, and if your spamd processes are hitting swap, that'll greatly increase load and scan-times. so yes, it'd correlate with memory use.
I've run into this on my own server, btw, and it was *definitely* load-related there. It doesn't take a lot of load to cause it, either. What happens is, the select call uses a 2-second timeout. Now, this alarm in question is intended to "sanity-check" that select call and catch the case where it ignores that timeout. It does this by doubling the select's timeout, and if this second (backup) timeout is hit, it complains, because select() should have timed out itself, instead of the backup timeout being hit: # use alarm to back up select()'s built-in alarm, to debug theo's bug eval { Mail::SpamAssassin::Util::trap_sigalrm_fully(sub { die "pf select timeout"; }); alarm ($tout*2) if ($tout); ($nfound, $timeleft) = select($rout=$rin, undef, $eout=$rin, $tout); }; alarm 0; Unfortunately, double 2 is only 4, which really isn't very big -- it's pretty easy for a sudden load spike to stop a single process from executing two consecutive lines of a perl script within 2 seconds, especially since there's an eval {} context switch between the two.
Created attachment 3343 [details] possible fix here's a possible fix; I've applied it to SVN trunk anyway, as r372085, and will be running it on my server to see if it helps. (note to self: I applied this at : [no view] exit=0 Wed Jan 25 03:22:37 GMT 2006 last timeout log was at 'Jan 25 03:06:54 dogma spamd[27797]'.)
seems to have cleared it up here. I suggest anyone else seeing this give that patch a try....
I, too, had this happen this morning while moving 10 GB across IDE drives. Patch looks reasonable to me. Moving to 3.1.1 queue.
nah, this didn't fix it -- it showed up again today under *massive* load (like load average > 180, caused by other procs on the system). it's the right general approach though, so I'm tidying it up and looking into some other spurious "errors" that also showed up during that incident.
Created attachment 3348 [details] better fix ok, this is a better fix. That code: (a) could leak alarm die() noise (a la bug 4631); (b) the extra few seconds were still not enough; (c) it wasn't protecting itself against any other active timeouts, so a timeout-within-a-timeout wasn't going to work; (d) it should have been stopping the timeout timer before the eval { } scope was exited, instead of after; (e) it should have been treating an alarm timeout event as equivalent to the select() timeout, instead of issuing warnings, since under heavy load, the alarm() timeout may be noticed first. So, all in all, (b) and (e) were the big issues that causes this bug. All the others, (a), (c), and (d), are ones that we have fixed in other places (e.g. cf. bug 4631), but not yet here. BTW obviously that's not good. I plan to abstract all these timeout timers into a new class in svn trunk, I'm sick of fixing the same bugs again and again due to duplication :(
I was going to ask if 10 seconds was long enough. I bet it wouldn't have when cyan hit a load average of 980 two weeks ago. ;) Are the leaked messages errors or warns?
the leaked messages are those __alarm__ warns; ugly and misleading.
*** Bug 4563 has been marked as a duplicate of this bug. ***
the "prefork: select returned undef! recovering" messages are also spurious when they appear directly after "spamd: handled cleanup of child pid 13101 due to SIGCHLD". I have a fix for this, but it's tied up with a bigger change, so not so easy to patch now...
I'm coming over from bug 4563 Just wanted to report that since I have applied the first patch I haven't seen any timeouts in my log. I have since applied the second patch.
err, and nobody agreed with me on comment #3 ;) i've applied the patch and the issue has disappeared here.
oops, let me clarify. i havent had the problem outlined in bug 4693 since applying this patch.. FWIW.
Created attachment 3352 [details] further fixing ok, here's the big change. This one adds a new Timeout class to portably and cleanly encapsulate SIGALRM/POSIX messiness throughout the code; capture and inhibit a spurious "select returned -1" error when spamd children exit; fix an incorrect comment about write retries in SpamdPreforkScaling. it's in trunk as r373962.
Created attachment 3353 [details] silly further fix another one, on top of 3348 and 3352. This fixes a minor bug where alarms were being treated as normal die() failures instead of timeouts due to a typo. So far I haven't seen any "tcp timeout", "alarm failed", or "select failed" warnings; they've cleared right up on my dogfood server. Strongly recommend trying this out...
For anyone applying the last 2 patches to 3.1. You need to have the patches from bug 4651 installed (so that they can be removed by this patch :-) Hunk #2 of the MANIFEST patch will fail. Don't worry about that. I had to do a 'make clean', a 'perl Makefile.PL' & a 'make' to get the new Timeout.pm to be placed into the blib tree. The patch to Conf.pm seems to be a strange thing to include here.
> You need to have the patches from bug 4651 installed... Make that bug 4631.
A few things: 1) I'm pretty sure that this patch just masks any problem that is occurring in bug 4693. It doesn't actually fix it. 2) Justin's commits have said "workaround for spurious 'tcp timeout' warnings". This bug says (comment #0) that spamd actually stops responding to connections. It's not just spurious warnings like reported in bug 4563 (marked as duplicate). So I'd say this fixes bug 4563, which isn't actually a duplicate (if the warnings Justin, Tom and Chris are seeing are in fact just spurious) and that this bug (bug 4696) isn't yet fixed (unless anyone has seen otherwise -- GREG ARE YOU OUT THERE?!). 3) If this is all working for folks, is there a mininmal patch for 3.1.1? With the difficulty in getting votes for 10 line patches, I can just imagine the support for adding a new (Timeout) class to the 3.1 branch.
Any real problems are being obscured -- quite a lot -- by this error message. So far, this has been the main place people are talking about this error, so there probably are indeed multiple failure cases in this bug ID. This patch fixes *some* of them ;) also I'm not really expecting anyone to actually *vote* on the patch for 3.1.1, given recent voting rates. ;) But people who are running servers seem happy enough to have a patch that can be applied (with a little work) to clean up the errors.
Yeah, I'm not worried about getting a fix for "spurious warnings" into 3.1. On the other hand, fixes for actual spamd *failures* need to go into 3.1.1. After 4+ months of telling people to use --round-robin, it'd be nice to get a release out that fixes at least some of the problems. I'm just trying to identify what has been fixed by the patches in this bug, and what's necessary to get the needed fixes into 3.1 without too much voter apathy. GREG, SEB, TUOMAS: Do Justin's patches stop spamd from dying on you?
'After 4+ months of telling people to use --round-robin, it'd be nice to get a release out that fixes at least some of the problems. I'm just trying to identify what has been fixed by the patches in this bug, and what's necessary to get the needed fixes into 3.1 without too much voter apathy.' agreed. FWIW, I'm finding that much of the apparent failure cases are spurious, and caused by these warnings making noise about non-error conditions; getting those out of the way is helping narrow down the *real* bugs.
I wonder if the cause of spamd hanging is that when there is a timeout or communication is lost with a child, the problem child is not properly disposed of. It looks like patch 3352 does include a change to child cleanup. Also, making the timeout code more regular should help in analyzing the related child cleanup code. The patch is not really that big (part of it is the same change in 5 files). It should be considered for 3.1.1. It would also be good if people would try it out and see how many problems it fixes.
Created attachment 3365 [details] aggregated patch against b3_1_0 svn branch This new patch aggregates patches 3343, 3348, 3352, and 3353; it's a lot easier to apply, and is against the current code on SVN branch b3_1_0. there are no functional changes over those patches. Important: if you're rebuilding with this patch applied, you need to run "make clean", re-run "perl Makefile.PL", "make", after applying the patch, since it adds a new file.
Created attachment 3366 [details] aggregated patch against b3_1_0 svn branch take 2 Minor redo of patch 3365; that "TODO" comment in Conf.pm slipped in, removing.
Created attachment 3367 [details] aggregated patch, against the 3.1.0 release tarball This is the aggregated patch, against the 3.1.0 tarball, for people running into issues with 3.1.0 who want to try this out. (aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and b3_1_0 are major by now.)
damn diff/patch. in the agg patch against 3.1.0, Timeout.pm is left in the top-level directory, instead of its correct location. In addition, there was a bonus line of code that shouldn't have been there, in Pyzor.pm. new patch on the way...
Created attachment 3368 [details] aggregated patch, against the 3.1.0 release tarball, take 2 redo, again
The following mail to the dev list from Daryl did not make it in as a comment. >> ------- Additional Comments From jm@jmason.org 2006-02-10 20:30 ------- >> >> (aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and >> b3_1_0 are major by now.) > >++1 We're approaching 5 months since the 3.1.0 release. > >It'd be nice if we had some feedback on whether the patch in bug 4696 >was resolving people's problems with spamd. I'd imagine it is, but it'd >be nice to know for sure.
Tom, any problems since applying the latest patch? You seem to be the only one left alive...
(In reply to comment #46) > The following mail to the dev list from Daryl did not make it in as a comment. > > >> ------- Additional Comments From jm@jmason.org 2006-02-10 20:30 ------- > >> > >> (aside: we need to cut a 3.1.1 soon; the differences between 3.1.0 tarball and > >> b3_1_0 are major by now.) > > > >++1 We're approaching 5 months since the 3.1.0 release. > > > >It'd be nice if we had some feedback on whether the patch in bug 4696 > >was resolving people's problems with spamd. I'd imagine it is, but it'd > >be nice to know for sure. <delurk> I've been having this problem in 3.1.0 (tarball), starting with tcp timeouts in SpamdForkScaling.pm, then "prefork: select returned undef! recovering", and finally "prefork: syswrite(7) failed, retrying... at /usr /lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 554." which would totally paralyze spamd and require a restart. I wasn't able to get the first set of patches to work, but the aggregate against 3.1.0 (tarball) worked fine. spamd has been running clean all day with the new patch in place. Thanks for the patch!
I only have these problems when I go on vacation, so I won't be much help in evaluating how much this patch helps in the short term. I am hoping that there will be a final fix for this and the other similar problems before I go out of town again. A couple of things that I noticed. The patch from bug 4590 is not included here, but the fix for the bad comment from that patch is included here. The 'silly further fix' (attachment id-3353) was not included.
'A couple of things that I noticed. The patch from bug 4590 is not included here, but the fix for the bad comment from that patch is included here. The 'silly further fix' (attachment id-3353) was not included.' you're right on both counts. I need to combine the bug 4590 patch in here, and include the "silly further fix", in a new patch.
Is this the same problem I am having (see sample of log below) my sa is version 3.1.0. I have SA sat on a desktop PC and working in conjuction with kmail on e-mails from ISP's It has been setup like this for 18 months without a problem. Sorry if I am doing this wrong but it is the first time I have submitted a bug (or something that maybe a bug) Tim Feb 13 10:34:25 localhost spamd[3989]: Can't locate Mail/SPF/Query.pm in @INC (@INC contains: ../lib /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.7 /usr/local/share/perl/5.8.7 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl) at /usr/share/perl5/Mail/SpamAssassin/Plugin/SPF.pm line 272, <GEN304> line 63. Feb 13 10:34:25 localhost spamd[3989]: Can't locate Mail/SPF/Query.pm in @INC (@INC contains: ../lib /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.7 /usr/local/share/perl/5.8.7 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl) at /usr/share/perl5/Mail/SpamAssassin/Plugin/SPF.pm line 272, <GEN304> line 63. Feb 13 10:34:26 localhost spamd[3989]: spamd: identified spam (26.8/5.0) for mit:1000 in 1.6 seconds, 2886 bytes. Feb 13 10:34:26 localhost spamd[3989]: spamd: result: Y 26 - BAYES_99,DATE_IN_FUTURE_12_24,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_BOGUSMX,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,MIME_BASE64_BLANKS,MIME_BASE64_NO_NAME,MIME_BASE64_TEXT,RATWARE_NAME_ID,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_WHOIS_INVALID,RCVD_IN_XBL scantime=1.6,size=2886,user=mit,uid=1000,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=36224,mid=<EDBPOMNMAIMGDMGGNNKFDIBLCNAB.edmondfink_xo@ceril.cl>,bayes=1,autolearn=spam Feb 13 10:34:27 localhost spamd[3968]: prefork: child states: II Feb 13 13:55:02 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:55:16 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:55:44 localhost spamd[3968]: prefork: select returned undef! recovering Feb 13 13:56:39 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:56:43 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:57:50 localhost last message repeated 2 times Feb 13 13:57:57 localhost spamd[3968]: prefork: select returned undef! recovering Feb 13 13:58:38 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:59:14 localhost spamd[3968]: tcp timeout at /usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 195. Feb 13 13:59:38 localhost last message repeated 2 times Feb 13 13:59:42 localhost spamd[3968]: panic: leave_scope inconsistency. Feb 13 15:05:27 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused Feb 13 15:05:28 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#2 of 3): Connection refused Feb 13 15:05:29 localhost spamc[23543]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#3 of 3): Connection refused Feb 13 15:05:30 localhost spamc[23543]: connection attempt to spamd aborted after 3 retries
I manage quite a few RHES3 boxes, running SA. The load on the boxes are quite high. I was having problems described in bug 4696 and 4590. After turning on "round- robin" I was able to avoid the problems, but still every once in a while SA would crash and stop doing anything with logs like: @4000000043f03e18333f0e84 [12505] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000043f03e18333f6c44 [12505] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN841> line 204. @4000000043f03e18333fb67c [12505] warn: ) @4000000043f03e3305fb22b4 [12207] warn: plugin: eval failed: child processing timeout at /usr/bin/spamd line 1088. @4000000043f03e5113e187ec [12423] warn: rules: failed to run PYZOR_CHECK test, skipping: @4000000043f03e5113e1e994 [12423] warn: _(child processing timeout at /usr/bin/spamd line 1088. @4000000043f03e5113e233cc [12423] warn: ) @4000000043f03e5730b9abc4 [13178] warn: rules: failed to run DCC_CHECK test, skipping: @4000000043f03e5730ba1154 [13178] warn: _(child processing timeout at /usr/bin/spamd line 1088. @4000000043f03e5730ba57a4 [13178] warn: ) @4000000043f03e5818ba6be4 [12897] warn: rules: failed to run PYZOR_CHECK test, skipping: @4000000043f03e5818bacd8c [12897] warn: _(child processing timeout at /usr/bin/spamd line 1088. @4000000043f03e5818bb13dc [12897] warn: ) @4000000043f03f74331b2334 [30053] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000043f03f74331b271c [30053] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN598> line 68. @4000000043f03f74331b2eec [30053] warn: ) @4000000043f03f7826560894 [13980] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000043f03f7826561064 [13980] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN862> line 335. @4000000043f03f7826561c1c [13980] warn: ) After applying this patch, I got rid of "round-robin" option. Everything is running fine since then (would be almost 2 whole days now). I often see messages like below: @4000000043f532c62bad4064 [9205] warn: pyzor: check failed: __alarm___ignore__ @4000000043f532eb106eb7ec [19553] warn: razor2: razor2 check failed: __alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Ma il/SpamAssassin/Plugin/Razor2.pm line 316. @4000000043f533783b6905c4 [9205] warn: razor2: razor2 check failed: __alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Mai l/SpamAssassin/Plugin/Razor2.pm line 316. @4000000043f533a9325fce2c [9205] warn: pyzor: check failed: internal error @4000000043f533b50382cbcc [9205] warn: razor2: razor2 check failed: __alarm___ignore__ at /usr/lib/perl5/site_perl/5.8.0/Mai l/SpamAssassin/Plugin/Razor2.pm line 316. But SA keeps running and keeps doing its job.
Created attachment 3380 [details] aggregated patch against b3_1_0 svn branch take 3 here's a new take on patch 3366; it includes the "silly fix" and the fix for bug 4590.
Created attachment 3381 [details] aggregated patch against the 3.1.0 release tarball take 3 and here's the redo of patch 3368. Again, this is the same code with the fix for bug 4590 and the "silly fix" merged in. If you're seeing SpamdForkScaling warnings or errors, we STRONGLY recommend retrying with svn trunk, or one of these two patches -- probably this one.
ps: Gambit, that patch update would remove those noisy warnings. but they should be pretty harmless anyway, so it's really optional.
Couple of my boxes were still having problems reported in bug 4590. I have just rolled out the patch (id=3381) for the servers. Everything seems ok as of yet, the warnings have disappeared.
fwiw, with this and bug 4791, I see no warnings during daily spamd use with svn trunk. everything's working 100% on my colo, which is nice.
On the 7th day (after applying the patch id=3381), spamd on one of my servers just crashed again. The output from psaux | grep spamd after crash: root 2021 0.0 0.0 1444 284 ? S Feb24 0:00 supervise spamd qmaill 2029 0.0 0.0 1592 344 ? S Feb24 0:20 /usr/local/bin/multilog t s20000000 n30 /var/log/spamd root 2036 0.0 1.3 32096 28800 ? S Feb24 1:54 /usr/bin/perl - T -w /usr/bin/spamd -q -v -x -u vpopmail -s std err --max-children=16 --min-spare=4 --max-spare=8 --max-conn-per-child=100 -- timeout-child=28 vpopmail 22999 0.0 1.8 42320 37388 ? S 09:18 0:05 spamd child vpopmail 27055 0.0 1.4 34900 29540 ? S 09:25 0:00 spamd child vpopmail 19934 0.1 1.6 38780 33904 ? S 10:06 0:12 spamd child vpopmail 23915 0.0 1.4 35032 29756 ? S 10:12 0:00 spamd child vpopmail 25658 0.0 1.4 35308 29996 ? S 10:15 0:00 spamd child vpopmail 30469 0.0 1.4 34904 29544 ? S 10:24 0:00 spamd child vpopmail 8372 0.4 1.7 41616 36712 ? S 10:40 0:22 spamd child vpopmail 9460 0.1 1.6 38176 33240 ? S 10:42 0:05 spamd child root 11404 0.0 0.0 3936 572 pts/2 S 12:06 0:00 grep spamd Log says: 0000044038f8e0eddc74c [9460] info: spamd: result: . -2 - BAYES_00,FORGED_RCVD_HELO scantime=2.4,size=5393,user=andy@FFF.com,uid=89,required_score=7.5,rhost=localho st.localdomain,raddr=127.0.0.1,rport=20815,mid=<44038F5E.3020709@ld FGD.org>,bayes=0,autolearn=ham @4000000044038f8e1192aa34 [2036] info: prefork: child states: IIIIIIII @4000000044038f900ba38f94 [8372] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 20818 @4000000044038f900c8a9434 [8372] info: spamd: processing message <E3F38E13-C70A- 4805-AD03-334EE21F0CB8@VVV.com> for andrewg@BBBB.com:89 @4000000044038f9226cb01e4 [8372] info: spamd: clean message (-2.5/7.5) for andrewg@BBBB.com:89 in 2.5 seconds, 11207 by tes. @4000000044038f9226cb1d3c [8372] info: spamd: result: . -2 - AWL,BAYES_00,HTML_MESSAGE scantime=2.5,size=11207,user=andrewg@AAAA.com,uid=89,required_score=7.5,rhost=lo calhost.localdomain,raddr=127.0.0.1,rport=20818,mid=<E3F38E13-C70A-4805-AD03- 334EE21F0CB8@VVV.com>,bayes=0,autolearn=ham @4000000044038f922a12ca44 [2036] info: prefork: child states: IIIIIIII 0000044038fa53a2da764 [23915] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 20826 @4000000044038fa53afe64e4 [23915] info: spamd: processing message <20060227234741.HYGM15112.CFD.com@omta03ps> for lisa@YYYY.com:89 @4000000044038fa931c186a4 [25658] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 20828 @4000000044038fa932c9e654 [25658] info: spamd: processing message <002101c63bf8 $325c5d40$0300a8c0@KIDZBODZ1> for jackie@YYYY.com:89 @4000000044038fac026c819c [27055] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 20829 @4000000044038fac03211814 [27055] info: spamd: processing message (unknown) for wobbz@XXX.com:89 @4000000044038fac24a3e7dc [30469] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 20830 @4000000044038fac255ae39c [30469] info: spamd: processing message (unknown) for fritz@XXX.com:89 @4000000044038fb530b926f4 [8372] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3889> line 118. @4000000044038fb530bac8ec [8372] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3889> line 118. @4000000044038fb530bc91f4 [8372] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fb530bcf784 [8372] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3889> line 118. @4000000044038fb530bd41bc [8372] warn: ) @4000000044038fb81da5d9f4 [9460] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3692> line 130. @4000000044038fb81da76094 [9460] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3692> line 130. @4000000044038fb81da91de4 [9460] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fb81da97f8c [9460] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3692> line 130. @4000000044038fb81da9c9c4 [9460] warn: ) @4000000044038fba0cb83bb4 [19934] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3750> line 118. @4000000044038fba0cb9c63c [19934] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3750> line 118. @4000000044038fba0cbbcdc4 [19934] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fba0cbc2f6c [19934] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3750> line 118. @4000000044038fba0cbc79a4 [19934] warn: ) @4000000044038fbe11b94504 [22999] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3627> line 77. @4000000044038fbe11bacba4 [22999] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3627> line 77. @4000000044038fbe11bccb5c [22999] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fbe11bd30ec [22999] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3627> line 77. @4000000044038fbe11bd7b24 [22999] warn: ) @4000000044038fc13a7dfdcc [23915] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3625> line 67. @4000000044038fc13a7f6144 [23915] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3625> line 67. @4000000044038fc13a81592c [23915] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fc13a81bad4 [23915] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3625> line 67. @4000000044038fc13a82050c [23915] warn: ) @4000000044038fc531e8255c [25658] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3628> line 294. @4000000044038fc531e984ec [25658] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3628> line 294. @4000000044038fc531eb80bc [25658] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fc531ebea34 [25658] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3628> line 294. @4000000044038fc531ec346c [25658] warn: ) @4000000044038fc802b438fc [27055] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3598> line 36. @4000000044038fc802b5429c [27055] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3598> line 36. @4000000044038fc802b72ecc [27055] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fc802b77904 [27055] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3598> line 36. @4000000044038fc802b7bf54 [27055] warn: ) @4000000044038fc824caa9bc [30469] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3631> line 36. @4000000044038fc824cbedf4 [30469] error: child processing timeout at /usr/bin/spamd line 1088, <GEN3631> line 36. @4000000044038fc824cda75c [30469] warn: rules: failed to run RAZOR2_CF_RANGE_E8_51_100 test, skipping: @4000000044038fc824cdf964 [30469] warn: _(child processing timeout at /usr/bin/spamd line 1088, <GEN3631> line 36. @4000000044038fc824ce2844 [30469] warn: ) After a spamd restart, it came back on with following message: @400000004403a2410bb6023c [2036] info: spamd: server killed by SIGTERM, shutting down @400000004403a241118fd3a4 [2036] info: spamd: server killed by SIGTERM, shutting down @400000004403a241118fdb74 [2036] error: Can't call method "close" on an undefined value at /usr/bin/spamd line 1998 during gl obal destruction. @400000004403a241118fe344 [2036] error: Can't call method "close" on an undefined value at /usr/bin/spamd line 1998 during gl obal destruction. @400000004403a2432317b22c [12374] info: spamd: server started on port 783/tcp (running version 3.1.0) @400000004403a243231a55f4 [12374] info: spamd: server pid: 12374 @400000004403a243239e280c [12374] info: spamd: server successfully spawned child process, pid 12442 @400000004403a2432414ea64 [12374] info: spamd: server successfully spawned child process, pid 12446 @400000004403a243249ab84c [12374] info: spamd: server successfully spawned child process, pid 12447 @400000004403a243250e4e24 [12374] info: spamd: server successfully spawned child process, pid 12449 @400000004403a243258723bc [12374] info: spamd: server successfully spawned child process, pid 12450 @400000004403a2432606a844 [12374] info: spamd: server successfully spawned child process, pid 12453 @400000004403a24326801634 [12374] info: spamd: server successfully spawned child process, pid 12456 @400000004403a24326f3c37c [12374] info: spamd: server successfully spawned child process, pid 12457 @400000004403a2432744f4a4 [12442] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 48138 @400000004403a24327540fd4 [12374] info: prefork: child states: BIIIIIIS @400000004403a2432789930c [12446] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 48090 @400000004403a243287a9644 [12447] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 48139 This is a failry busy server, (4.5K messages per hour during a daily peak). But please let me know if any more info in needed for debugging.
+1, I've had all these patches in production for a while.
Gambit: looks a lot like Razor was timing out, there. Unfortunately the patch cannot fix Razor if its servers start timing out. ;) What were the effects of the crash? Spamd and the child processes were still running; were they refusing to process mail? or processing slowly? (the latter would be normal if Razor was taking 30 seconds before timing out on each mail.) Were all mails being marked as ham but still delivered? Can you decode the timestamps on those log lines? Also, it's better to attach long log extracts as text attachments, instead of posting them in-line.
Created attachment 3395 [details] log
Hi, During this time, spamd and all the children were all running (as the PS aux output from previous post), but they stopped processing mails. My spamc runs with a timeout of 30 seconds, so all the mails were passed on for further processing after that timeout. Thanks.
Gambit: that looks like some kind of separate bug, where a timeout is "bleeding over" into another alarm handler; not related to this bug (althought presumably you have this patch applied). interesting.
Created attachment 3396 [details] more logs
This problem again was shown on another server (even though I am restarting spamd twice daily from cron). I am using patch 3396 on top of fresh 3.1 source. The boxes are all RHES3.
Sorry, the last post was meant to state "patch 3381 on top of fresh 3.1 source" instead of "patch 3396 on top of fresh 3.1 source".
I'd appreciate if the "bleeding alarms" issue can be opened as a new BZ; this one's getting unwieldy and will be hard to follow if another issue is thrown in here.
I'd open another bug, but it'd just say that you forgot to run "svn add" for Timeout.pm for attachment 3380 [details]. I started with a clean copy of 3.1 branch and patched it with 3380, but didn't remove the existing installed copy. Since Timeout.pm is missing from 3380 I'm still using the inital Timeout.pm that is missing the triple underscore fix from r374162, which fixed the alarm logging issue. So... I revoke my +1 for 3380 and vote +1 for an identical patch with the addition of Timeout.pm as it is in r374162. :)
+1 I've only done a small amount of testing, but I don't see the problem as described so ... btw, I don't think we need a new patch, let's just agree to do: svn cp https://svn.apache.org/repos/asf/spamassassin/trunk/lib/Mail/SpamAssassin/Timeout.pm lib/ Mail/SpamAssassin/Timeout.pm after the patch is applied and before the commit.
+1 on patch 3380 plus svn add of lib/Mail/SpamAssassin/Timeout.pm from trunk And in the interest of getting things cleaned up for a 3.1.1 release: Committed to 3.1 branch revision 384590.