SA Bugzilla – Bug 3828
spamd parent stops accepting requests
Last modified: 2005-03-16 20:58:05 UTC
I'm currently trying to track down another spamd problem(s) that seems to be pretty random. Happens about once every day on a 4 box spamd cluster. One of the boxes will just drop to 0 load and spamd will apparently be hung. restarting spamd makes it take right off again. Its random in which box it occurs on. Also, over the weekend I caught another box on another network doing similar things. I have some ps and strace info from each, not sure how much it will help until i can debug this out further. There is nothing in the spamd log that indicates it has stopped. Weird thing I always see when it happens is the port 783/tcp shows as filtered until i restart spamd. Here is a little info. First, a process listing, 15 children. Parent is 17752. [root@spamd2 root]# ps auxwww | grep spamd root 1217 0.0 0.0 1340 60 ? S Jun23 0:00 supervise spamd root 17752 0.0 2.8 23652 21888 ? S Sep22 0:05 /usr/bin/perl - T -w /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 -q -x -m 15 --max-conn-per-child 25 root 28432 0.0 3.6 29472 27596 ? S Sep22 0:07 spamd child root 28511 0.0 3.2 26916 25136 ? S Sep22 0:08 spamd child root 28520 0.0 3.2 26252 24532 ? S Sep22 0:06 spamd child root 28702 0.0 3.6 29756 27884 ? S Sep22 0:07 spamd child root 28832 0.0 3.2 27016 25220 ? S Sep22 0:04 spamd child root 28842 0.0 3.2 26892 25100 ? S Sep22 0:02 spamd child root 28882 0.0 3.0 25116 23400 ? S Sep22 0:01 spamd child root 26097 0.0 3.2 26780 24916 ? S 04:11 0:06 spamd child root 26142 0.0 3.1 25848 24052 ? S 04:11 0:03 spamd child root 26147 0.0 3.5 28636 26840 ? S 04:11 0:07 spamd child root 26148 0.0 3.2 26416 24540 ? S 04:11 0:08 spamd child root 26213 0.0 3.3 27600 25804 ? S 04:12 0:04 spamd child root 26238 0.0 3.1 26132 24284 ? S 04:12 0:02 spamd child root 26283 0.0 3.0 25292 23580 ? S 04:12 0:02 spamd child root 26308 0.0 3.1 26032 24300 ? S 04:13 0:03 spamd child root 28492 0.0 0.0 1764 592 pts/4 R 08:12 0:00 grep spamd trying to echo|spamc never returns... stracing the parent shows it in pause() state. [root@spamd2 root]# strace -p 17752 pause( checking the tcp socket shows it is filtered? [root@spamd2 root]# nmap -sT -p783 localhost Interesting ports on localhost.localdomain (127.0.0.1): Port State Service 783/tcp filtered hp-alarm-mgr restarting spamd makes it available again... [root@spamd2 root]# qmailctl restart [root@spamd2 root]# nmap -sT -p783 localhost Interesting ports on localhost.localdomain (127.0.0.1): Port State Service 783/tcp open hp-alarm-mgr ------------------------------------------- and another... although this looks a little different. very long run times on spamd procs (looks like DoS again)? 11:55pm up 175 days, 3:50, 1 user, load average: 2.52, 2.34, 2.29 142 processes: 136 sleeping, 4 running, 2 zombie, 0 stopped CPU states: 99.0% user, 0.9% system, 0.0% nice, 0.0% idle Mem: 253876K av, 232444K used, 21432K free, 0K shrd, 40336K buff Swap: 522104K av, 177936K used, 344168K free 64336K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 31797 root 15 0 26480 196 52 R 46.9 0.0 16781m spamd 23746 root 14 0 26052 36 4 R 46.7 0.0 940:22 spamd echo|spamc results in nothing. [root@email root]# echo | spamc process listing running with 2 childs. [root@email root]# ps auxwww | grep spamd root 8153 0.0 0.3 25024 904 ? S Aug26 0:49 /usr/bin/perl - T -w /usr/bin/spamd --syslog-socket=none -q -x -m 4 --max-conn-per-child 25 root 10541 0.0 2.7 25384 6920 ? S 20:03 0:00 spamd child root 25227 16.1 8.1 27432 20664 ? S 23:56 0:05 spamd child strace the parent shows wait(), children show read(). [root@email root]# strace -p 8153 wait4(-1, [root@email root]# strace -p 25227 read(6, <unfinished ...> [root@email root]# strace -p 10541 read(6, nothing in the spamd debug log until spamd is restarted. -------------------------------------------- hopefully i'll discover more as it happens again.
Subject: Re: New: spamd parent stops accepting requests On Mon, Sep 27, 2004 at 06:46:11AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > trying to echo|spamc never returns... stracing the parent shows it in pause() > state. the parent sits in a sleep() loop waiting for signals, such as sigchld and such. the children are what do all the work in 3.0.0.
any dev on right now that wants to take a look at this? i have it pulled from the DNS for a bit to try and figure out whats up. it sure looks to be a problem with the parent... [root@spamd1 /]# telnet localhost 783 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. quit this never responds and nothing logs until i restart spamd.
hmm.. looks like this may be the problem. netstat reports 16 tcp sessions established to 783/tcp, which would indicate all the children are busy right? however, there is no load, and nothing in the log ever showed max children reached. [root@spamd1 log]# netstat -na | grep :783 | grep ESTABLISHED | wc -l 16 [root@spamd1 log]# netstat -na | grep :783 | grep CLOSE_WAIT | wc -l 123 so the parent wont accept any new connection because it thinks all the children are busy, when they really arent? ps shows they are all sleeping... 15515 ? S 0:03 /usr/bin/perl -T -w /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 -q -x -m 15 --max-conn-per-child 75 9076 ? S 0:10 spamd child 9127 ? S 0:09 spamd child 9152 ? S 0:15 spamd child 28765 ? S 0:20 spamd child 29097 ? S 0:15 spamd child 29139 ? S 0:12 spamd child 20022 ? S 0:26 spamd child 20177 ? S 0:23 spamd child 20191 ? S 0:23 spamd child 20444 ? S 0:17 spamd child 20611 ? S 0:10 spamd child 20682 ? S 0:08 spamd child 20696 ? S 0:18 spamd child 20698 ? S 0:07 spamd child 21294 ? S 0:04 spamd child strace shows they are in a read(). [root@spamd1 log]# strace -p 21294 read(6, is something just not cleaning up properly? here again is where a timeout on spamd children would be a good thing. the parent could sigterm the child if it hasnt finished in x seconds.
Subject: Re: spamd parent stops accepting requests On Tue, Sep 28, 2004 at 06:14:27AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > the DNS for a bit to try and figure out whats up. it sure looks to be a problem > with the parent... The parent has nothing to do with the connection, period. All the parent does is spawn children when necessary. The children are the ones which accept the connection, read in the request, do the processing, etc. > [root@spamd1 log]# strace -p 21294 > read(6, what is fd 6? do an lsof on the process.
err.. now i'm going to have to wait until it happens again. seems to be popping up about every 2 hours or so. i've come up with a lot more info in the meantime, maybe this will be more helpful? spamd1 has gone away 3 times this morning.. everytime it does (and comes back usually 30+ minutes later), you see the 'bad protocol' dbg message in the log. from 9:11 to 9:27 GMT it stopped. @4000000041592ae93729be04 2004-09-28 09:11:59 [20696] i: connection from p35n0.ruraltel.net [24.225.0.35] at port 60169 @4000000041592e7105aea894 2004-09-28 09:27:03 [9152] i: bad protocol: header error: (closed before headers) from 9:27 to 10:48 GMT it stopped. @4000000041592e7f0cdbaa04 2004-09-28 09:27:17 [9152] i: connection from p36n0.ruraltel.net [24.225.0.36] at port 50510 @40000000415941a135605c2c 2004-09-28 10:48:55 [9152] i: bad protocol: header error: (closed before headers) from 10:49 to 12:26 GMT it stopped. @40000000415941ac2dd64e6c 2004-09-28 10:49:06 [9152] i: connection from p35n0.ruraltel.net [24.225.0.35] at port 59890 @40000000415958941293ed94 2004-09-28 12:26:50 [20696] i: bad protocol: header error: (closed before headers) when it was down this last time, i decided to kill -9 15515 a.k.a the last 'spamd child' on my ps listing. once i did that, approx 130 email's were processed by the spamd children. The weird thing is, this box was completely out of the DNS and would not be accepting 'new' connections. all those message had to be sitting in the tcp sockets waiting to be processed, and once the childs became available, they flooded through? also saw a warn() and a dbg about a dead tcp socket, after i killed that one spamd child. @40000000415981e41bfd1f7c Use of uninitialized value in subroutine entry at /usr/lib/perl5/5.6.1/i386-linux/Socket.pm line 312. @40000000415981e41c00924c 2004-09-28 15:23:06 [24774] i: error: Bad arg length for Socket::unpack_sockaddr_in, length is 0, should be 16 at /usr/lib/perl5/5.6.1/i386-linux/Socket.pm line 312._ Transport endpoint is not connected, continuing so the 'bad protocol: header error' seems to be the constant factor here of when it comes back online by itself. question is what happens to the previous 'connection from' that causes everything else to stop... from spamd4 yesterday.. i: result: Y 5 - BAYES_20,DCC_CHECK,HTML_90_100,HTML_IMAGE_RATIO_04,HTML_MESSAGE,MIME_BOUND_NEXTP ART,OPTING_OUT,URIBL_SBL,URIBL_WS_SURBL scantime=0.3,size=8798,mid=<10962760108851042@mail2.ruraltel.net>,bayes=0.098527 1200051288,autolearn=no i: connection from p34n0.ruraltel.net [24.225.0.34] at port 47706 i: bad protocol: header error: (closed before headers) i: connection from p35n0.ruraltel.net [24.225.0.35] at port 38263 on spamd3 yesterday i: result: Y 11 - BAYES_50,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL scantime=0.2,size=2940,mid=<200409270903.i8R93xH6016221@vabeach.matchmaker.com>, bayes=0.573393109179598,autolearn=no i: connection from p35n0.ruraltel.net [24.225.0.35] at port 37916 i: bad protocol: header error: (closed before headers) i: connection from p35n0.ruraltel.net [24.225.0.35] at port 38257 from spamd4 on the 25th.. i: result: . 0 - AWL,BAYES_00,HTML_MESSAGE,HTML_NONELEMENT_00_10,HTML_TEXT_AFTER_BODY,HTML_TEXT_A FTER_HTML,MIME_HTML_ONLY,TO_ADDRESS_EQ_REAL,URIBL_SBL scantime=1.3,size=43426,mid=<24628953.1096103652348.JavaMail.root@straight-poop- mail.com>,bayes=0,autolearn=no i: connection from p36n0.ruraltel.net [24.225.0.36] at port 59415 i: bad protocol: header error: (closed before headers) i: connection from p35n0.ruraltel.net [24.225.0.35] at port 45541 is anyone else seeing this 'header error'?
here is what i have found. take a look at the ps listing below and you'll see the START time on the first 7 children is 03:XX and the last 8 children is 12:XX.. so i lsof a couple of the 03:XX children to see what the hell they are still doing. looks like FD6 is the tcp socket to the spamc client that initiated the request. gdb shows the stale children are in a read() state still 0x420dae14 in read () from /lib/i686/libc.so.6 there is no active spamc client on the other side of the TCP connection however, since they run with a 60 second timeout, so i assume that is why its stuck in read() forever? here is the ps and lsof's of a could old children. once all the children get old, spamd no longer processes requests, hence my problem. root 18297 0.0 0.3 27240 2940 ? S Sep27 0:02 /usr/bin/perl -T -w /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 -q -x -m 15 --max-conn-per-child 75 root 2015 0.0 3.1 31756 24420 ? S 03:07 0:22 spamd child root 2577 0.0 3.1 31080 23740 ? S 03:28 0:14 spamd child root 2586 0.0 3.3 33092 25668 ? S 03:29 0:14 spamd child root 3069 0.0 3.2 32356 24848 ? S 03:46 0:09 spamd child root 3210 0.0 3.0 30588 22964 ? S 03:53 0:05 spamd child root 3216 0.0 2.9 30260 22592 ? S 03:53 0:05 spamd child root 3342 0.0 2.9 30468 22692 ? S 03:59 0:04 spamd child root 30254 3.4 3.2 32376 24968 ? S 12:29 0:20 spamd child root 30308 3.5 3.4 33440 26084 ? S 12:30 0:21 spamd child root 30367 3.6 3.4 33444 26248 ? S 12:30 0:20 spamd child root 30379 3.2 3.3 32880 25488 ? S 12:31 0:17 spamd child root 30439 3.0 3.2 32020 24628 ? S 12:32 0:14 spamd child root 30544 2.8 3.0 30736 23264 ? S 12:33 0:11 spamd child root 30755 2.5 2.9 29956 22196 ? S 12:37 0:04 spamd child root 30771 2.1 2.8 29312 21460 ? S 12:37 0:03 spamd child root 30948 0.0 0.0 1764 588 pts/2 R 12:40 0:00 grep spamd [root@spamd3 spamd]# lsof -p 3216 .. spamd 3216 root 0r CHR 1,3 67004 /dev/null spamd 3216 root 1w FIFO 0,5 2768 pipe spamd 3216 root 2w FIFO 0,5 2768 pipe spamd 3216 root 3r REG 3,2 73077 328406 /usr/bin/spamd spamd 3216 root 4w FIFO 0,5 2768 pipe spamd 3216 root 5u IPv4 1362568002 TCP *:783 (LISTEN) spamd 3216 root 6u IPv4 1382114401 TCP 24.225.0.50:783->p35n0.ruraltel.net:60137 (ESTABLISHED) [root@spamd3 spamd]# lsof -p 3342 .. spamd 3342 root 0r CHR 1,3 67004 /dev/null spamd 3342 root 1w FIFO 0,5 2768 pipe spamd 3342 root 2w FIFO 0,5 2768 pipe spamd 3342 root 3r REG 3,2 73077 328406 /usr/bin/spamd spamd 3342 root 4w FIFO 0,5 2768 pipe spamd 3342 root 5u IPv4 1362568002 TCP *:783 (LISTEN) spamd 3342 root 6u IPv4 1382113269 TCP 24.225.0.50:783->p35n0.ruraltel.net:59749 (ESTABLISHED)
Subject: Re: spamd parent stops accepting requests On Tue, Sep 28, 2004 at 10:54:41AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > there is no active spamc client on the other side of the TCP connection however, > since they run with a 60 second timeout, so i assume that is why its stuck in > read() forever? in theory, the read() should return an EOF since the other end disconnected and then the spamd code would catch the failure and exit. it seems like what is happening is that all of the children get made busy with requests, and requests keep coming in and end up getting queued by the OS waiting for a child to do an accept(). At some point, the client decides to give up, and either the OS doesn't clear the connection out of the queue, and/or IO::Socket isn't doing the right thing. There's 2 getline() commands. The first gets the request and explicitly checks for an error. The other is a loop to read in the messge: while ( $_ = $client->getline() ) { which lets the while check for the error. Right after the loop, there's a call to parse(), without an error check to make sure that there was actually any data read for the message. I don't know what parse() will do at that point, but there should be a quick check in spamd to make sure a message was actually read in and return an error if there wasn't. BTW: This seems related to bug 3782.
> BTW: This seems related to bug 3782. "You are not authorized to access bug #3782."
strace output would be *really* useful here. (although possibly hard to gather :( )
I was in the FreeNode SpamAssassin IRC chat room and a guy in there told me to post some of this info. Apparently, we are experiencing the same issues. OS: RedHat Enterprise Linux ES release 3 (Taroon Update 2) Kernel: 2.4.21-20.ELsmp Processor: Intel(R) Pentium(R) 4 CPU 2.80GHz HT Perl: 5.8.1 We are running CPanel as our Control Panel on the server. We have been in talks with CPanel Technical Support and they seem to be at a loss as to why this is occuring. We are running Exim 4.42 with SpamAssassin 3.0.0. The ROOT SpamD daemon runs, spawns 5 SpamD child processes, and the SpamC processes run as users. For some reason, child SA processes will hang, causing SMTP 421 Lost Incoming Connection errors that pretty much stops the flow of all email on the server. The only way we have been able to kill SpamAssassin when this happens is to execute "kill -9 `ps -aef | grep spam | grep -v grep | awk '{print $2}'`" and then invoke "spamd -d" to restart it. Sometimes this works, sometimes not. I have grabbed some STRACE data from the server and the SpamC processes seem to be stuck in a RECV state, as indicated in the data below. # strace -f -p 13506 Process 13506 attached - interrupt to quit recv(3, # strace -f -p 14119 Process 14119 attached - interrupt to quit recv(4, # strace -f -p 14106 Process 14106 attached - interrupt to quit recv(3, # strace -f -p 14084 Process 14084 attached - interrupt to quit recv(3, Hopefully, this will be the ammo you need to start killing this bug.
well, i can only find this happening in 1 location right now, and really the only difference is this location runs a spamd cluster, and dns round robins the requests... spamd is invoked like this: /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 - q -x -m 15 --max-conn-per-child 75 and spamc clients connect from remote hosts like this... /usr/bin/spamc -dspamd -H -p783 -t30 -x -u pathall@ruraltel.net the differences between this setup and all other setups i have that run just fine are: 1) -dspamd vs -dlocalhost 2) -H flag on spamc call 3) -A flag is 127.0.0.0/8,24.225.0.0/24 vs just 127.0.0.0/8 i thought maybe it was the network infastructure causing it since i have only seen this happen at a single location, but there doesnt appear to be anything wrong here. The 3 MTA's are plugged into the same switch that the spamd cluster is on, so there is no routing or anything involved here. i'm still trying to get better debug of it, its very difficult to do. i have written some spamd child timeouts into the spamd daemon that appear to be working good also. prevents me from having to restart spamd now to regain mail flow. i'll try and clean it up and add a patch for review if anyone is interested... however, this is not the answer to fixing this problem, but may be the answer to help prevent runaway children that we have seen in the past. d
could everyone seeing this bug, post their spamd and spamc command line switches? perhaps it's something to do with -H...
I'm experiencing the same problem with spamc hanging on my mail server: bash-2.05b# ps aux | grep spamc qscand 23346 0.0 0.0 1904 4 ? S 19:40 0:00 /usr/bin/spamc -c -f -d 192.168.1.23 -u user@example.com bash-2.05b# strace -f -p 23346 recv(3, In the mean time a spamd child hangs on the spamd server: bash-2.05a# ps aux | grep spamd spamc 19519 104.1 2.7 22084 14128 ?? R 6:40PM 11:46.04 spamd child spamc 19507 0.0 1.8 19728 9608 ?? S 6:40PM 0:01.82 perl -T -w /usr/bin/spamd -i 192.168.1.23 -A 192.168.1.18 -x -u spamc spamc 19509 0.0 2.4 20676 12576 ?? S 6:40PM 0:00.82 spamd child spamc 19510 0.0 2.3 20260 12132 ?? S 6:40PM 0:00.40 spamd child Finally, spamc times out while spamd keeps eating CPU. This is the situation on the spamd server: bash-2.05a# lsof | grep 19519 perl 19519 spamc cwd VDIR 14,2 6528 2 / (/dev/disk0s2) perl 19519 spamc 0r VCHR 3,2 0t0 34932996 /dev/null perl 19519 spamc 1w VCHR 3,2 0t1012094 34932996 /dev/null perl 19519 spamc 2w VCHR 3,2 0t1012094 34932996 /dev/null perl 19519 spamc 4r VREG 14,2 73087 377323 / (/dev/disk0s2) perl 19519 spamc 5w VREG 14,2 502932 377858 / (/dev/disk0s2) perl 19519 spamc 6u inet 0x02f85f4c 0t0 TCP 192.168.1.23:783 (LISTEN) perl 19519 spamc 7u inet 0x02f874cc 0t0 TCP 192.168.1.23:783->192.168.1.18: 48499 (CLOSE_WAIT) When all the spamd childs are blocked like this one my spamd server just stops processing requests to port 783. I set up a cron every 30 minutes to run "kill `ps -aex | grep 'spamd child' | grep '?? R' | grep -v 'grep' | awk '{print $1}'`". This is a temporary solution of course.
Can someone experiencing this issue post the following: ps auwwx (or -ef) output strace the spamd pids lsof the spamd pids strace the spamc pids lsof the spamc pids I'm going to guess the lsof of spamc shows it waiting on results from spamd, and spamd is waiting for input from spamc. but no one's posted the above information all in the same comment yet. ;)
quick note. Alessandro uses -A flags on spamd, and -d flag on spamc just like i do. dunno if that plays any importance or not. >> Can someone experiencing this issue post the following: i think have provided most of what i can so far. stracing the spamd children are easy, but you will find there are no spamc clients to strace because they have all went away due to the spamc timeout. dunno why spamd children still hang onto a socket with no client on the other end?? my guess is since this is so random (2 hours to make 15 children stale doing 10k/hour), the only time it is really occuring is when the spamc timeout is encountered. i can probably set the timeout to 1 sec and try that theory myself. >> ps auwwx (or -ef) output see initial post.. note the start times. >> strace the spamd pids see initial post.. "strace the parent shows wait(), children show read()." >> lsof the spamd pids see comment #5 for lsof on 2 children. you want full lsof or just filehandles? >> strace the spamc pids impossible or very hard to match spamc requests with which spamd server they went to since it is round-robin. also, i never see stale spamc clients because of the spamc timeout. guess i could up the timeout limit to something really huge and try and catch it. >> lsof the spamc pids same thing as above.
By the way, we normally start SpamAssassin by running "spamd -d".
Also, we have put in this... /sbin/iptables -A INPUT -i lo -s 0.0.0.0/0 -d 0.0.0.0/0 -j ACCEPT I don't know if it helped or hurt the issue.
Created attachment 2420 [details] Spam message possibly causing the problem I think (but I'm not sure) that the problem is caused by something related to messages. This message caused spamc/spamd to hang. Does this help?
hmm.. i didnt have any problem scanning that message. 2004-10-06 13:12:22 [29542] i: connection from localhost.localdomain [127.0.0.1] at port 40076 2004-10-06 13:12:22 [29542] i: processing message <002c01c4ab50$dcb5ce3a$a9ca1cf0@ytp> for root:0. 2004-10-06 13:12:25 [29542] i: identified spam (8.3/4.0) for root:0 in 3.0 seconds, 50633 bytes. 2004-10-06 13:12:25 [29542] i: result: Y 8 - BODY_8BITS,EXTRA_MPART_TYPE,FORGED_OUTLOOK_TAGS,HTML_60_70,HTML_FONT_BIG,HTML_MESSAGE,HTML_TITLE_EMPTY,MIME_HTML_MOSTLY,MPART_ALT_DIFF,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_BY_IP,UNWANTED_LANGUAGE_BODY scantime=3.0,size=50633,mid=<002c01c4ab50$dcb5ce3a$a9ca1cf0@ytp>,autolearn=no
Created attachment 2421 [details] Adds 2 new timeouts to spamd this patch will add a tcp connect timeout option (--timeout-tcp) which is the time between when the socket was made and then the first header. If a client makes a connection but never sends headers, their connection will close server side in --timeout-tcp seconds. after it sees the first header, the --timeout-child kicks in. if the time it takes from after it process the header till the time its ready to return exceeds this timeout, the connection will be closed. usage... -T or --timeout-tcp = tcp socket timeout before seeing headers. -t or --timeout-child = child processing timeout after seeing headers. testing.. just using low values here to demonstrate... [root@dev root]# spamd -d -T5 -t10 [root@dev root]# telnet localhost 783 Connected to localhost. Escape character is '^]'. SPAMD/1.0 79 Timeout: (5 second socket timeout reading input from client) Connection closed by foreign host. [root@dev root]# telnet localhost 783 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. PROCESS SPAMC/1.1 SPAMD/1.0 79 Timeout: (10 second timeout while trying to PROCESS) Connection closed by foreign host. logging timeouts through logmsg shows... 2004-10-04 21:04:24 [6406] i: connection from localhost.localdomain [127.0.0.1] at port 39424 2004-10-04 21:04:29 [6406] i: timeout: (5 second socket timeout reading input from client) 2004-10-04 21:04:31 [6407] i: connection from localhost.localdomain [127.0.0.1] at port 39425 2004-10-04 21:04:45 [6407] i: timeout: (10 second timeout while trying to PROCESS) for those of you experiencing this bug, this is a good patch for you until we can resolve the real issue.
I've applied that patch to an instance of ours that was quitting every day or two, so we'll see how it handles. Solaris 9 with Perl 5.8.3 on a V240 in case it helps. using --timeout-tcp=5 and --timeout-child=50
hrmm... i finally have a dead spamd on a new box. here are the details of what i found. as you can see, it was using 99% CPU for a long ass time before i noticed it. <insert bad sysadmin joke here>. PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 29579 root 15 0 24620 14M 8 R 99.0 2.9 76:06 spamd last 2 lines of a gdb spamd 29279 shows.. > Loaded symbols for /lib/libnss_dns.so.2 > 0x08099ddd in Perl_runops_standard () couple of my children are now zombies... the other one is using all the resources on the box. [root@mailgw root]# ps auxww | grep spamd root 8908 0.0 0.0 23972 184 ? S Sep24 0:01 /usr/bin/perl -T -w /usr/bin/spamd --syslog-socket=none -r /var/run/spamd.pid -A 127.0.0.0/8 -i 127.0.0.1 -q -D -x -m 3 --max-conn-per-child 10 root 24796 0.0 0.0 0 0 ? Z Sep30 0:03 [spamd <defunct>] root 24929 0.0 0.0 0 0 ? Z Sep30 0:03 [spamd <defunct>] root 29579 86.0 2.9 28764 15036 ? R 12:14 76:37 spamd child [root@mailgw root]# strace -p 29579 [root@mailgw root]# good thing is, i modified my spamd dbg to log the pid on all lines... so i can easily find what this one was doing..... [root@mailgw spamd]# grep "\[29579\]" * | tai64nlocal | tail -4 Oct 06 12:23:10.599032 check[29579]: [ 8] already have 1 discovery servers Oct 06 12:23:10.599094 check[29579]: [ 8] Checking with Razor Discovery Server 66.151.150.12 Oct 06 12:23:10.599172 check[29579]: [ 6] No port specified, using 2703 Oct 06 12:23:10.599208 check[29579]: [ 5] Connecting to 66.151.150.12 ... appears to have hung on razor2 :( [root@mailgw spamd]# razor-admin -v Razor Agents 2.40, protocol version 3 i'm gonna upgrade to 2.61 to see if this goes away. dallas
> appears to have hung on razor2 :( okay.. scratch the razor2 comment. i'm not 100% sure since i thought i had 'sub dbg' changed to include pid on every line and i didnt... damn! i was trying to trace it through my debug log and its just too tough to do without pids on every line. i have my dbg() changed so i can figure this out next time hopefully.
*** Bug 3782 has been marked as a duplicate of this bug. ***
Any more ideas regarding this bug? Are we closer to a fix?
I keep hearing about a patch, but when will it be released to the general public.
the patch is on comment 20: http://bugzilla.spamassassin.org/attachment.cgi?id=2421&action=view people who are seeing this issue -- does this patch fix it/work around it? we need feedback as to whether it works or not, before we can consider incorporating that fix into 3.0.1. BTW, a hang in Perl_runops_standard() generally would indicate a perl compiler bug, afaik... trying other versions of perl may help.
We haven't had any problems since applying this, and it's been a little over a week, with SA processing ~200k emails a day. Even if this patch doesn't solve everyone's issues, it's a good thing to have, as the more options the better.
Subject: Re: spamd parent stops accepting requests As I believe I mentioned, I made a similar patch with alarms, and it worked perfectly. I have tried patch you've sent and it also works good, so I guess it's ready for 3.0.1.
Yeah, but that is the thing. We install SpamAssassin directly for the CPAN Distributions, custom hacks are really something we want to play with. When this patch becomes part of the standard CPAN Distribution, we will install it.
Correction... "custom hacks AREN'T really something we want to play with"
I am having a similar problem to what other people have described. I am running FreeBSD and am installing SpamAssassin using the FreeBSD port p5-Mail-SpamAssassin. spamd runs fine for a few hours but then stops spawning new children and stops accepting connections (all spamc connections timeout until spamd is restarted). I've applied the patch and will see if this changes the behavior. I would say that if I can go 24 hours without a problem then the patch fixed it. /l
Dallas's patch seems to be providing some much needed relief for the systems that I've seen affected by this issue. Furthermore, I think its just good form to have the spamd children timeout after a period of time by default. I know this is kind of a kludge for the real issue, but it seems workable and reasonable. It's too bad that 3.0.1 is already out; I would've liked to see this patch go into that release.
Dallas Engelken, http://bugzilla.spamassassin.org/show_bug.cgi?id=3828#c20 Any plans on making that one SpamD/SpamC timeout patch of yours available for SpamAssassin 3.0.1? Or is the file that you patch the same in 3.0.1 that a new patch won't be needed for 3.0.1?
Sounds like we need to apply this patch for 3.0.2, since it's now widely reported to fix the bug
Subject: Re: spamd parent stops accepting requests On Sat, Oct 23, 2004 at 12:29:31PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > ------- Additional Comments From jm@jmason.org 2004-10-23 12:29 ------- > Sounds like we need to apply this patch for 3.0.2, since it's now widely > reported to fix the bug > You mean, widely reported to work around the bug. I don't believe it fixes anything does it? It just adds a timeout(s) so if things get wedged they will become unwedged. I would hope we could find the actual cause of the bug. I do however think the timeout patches are a good thing. Michael
I know. The patch has helped, but hasn't totally fixed the issue. I thank Dallas Engelken for engineering this patch, it has helped. Now, we need to get the developers of SpamAssassin to know about this issue. This issue has been an issue since the days of SpamAssassin 2.64. I don't know why they didn't program in a timeout from the beginning. Every good programmer knows to code in a way to escape potential unending loops, and in this case, an unending loop is occuring. A loop that is sitting, waiting for data from a TCP/IP link that has been severed in a way that SpamAssassin doesn't know that it has been severed.
quick note that i'm seeing on this patch. a low timeout (ie 1-90 seconds) will timeout a child running a bayes auto-expire almost every time since the auto- expire usually takes more than 90 seconds on a 5-10mb _toks. so if you run a low child-timeout, your bayes database may never auto expire... i run a manual sa-learn --force-exire via cron, as this off loads it from the child and lets the child process mail instead of expiring toks. FYI, if you run spamc with a timeout and use auto expiry, you are most likely seeing spamc exit 74 assuming you are running spamc with -x flag... so this will most likely cause a tempfail. hopefully auto-expiry through spamd will someday be the job of the parent and not the children, this would help a bunch in this situation.
Subject: Re: spamd parent stops accepting requests On Tue, Nov 02, 2004 at 12:33:57PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > hopefully auto-expiry through spamd will someday be the job of the parent and > not the children, this would help a bunch in this situation. I actually have some plans about spamd for 3.1, where the expires won't happen in the scan/process child, but in a secondary "helper" child which the parent spawns off. ie: the scan child passes the information to the parent than a sync/expire needs to occur, then the parent passes that to the helper child to actually do the work.
Setting the tcp and child timouts to 10 seems to have solved the problem (thanks to ben@cpanel.net for pointing this out), and that seems to have solved the problem. Since spamd is a critical process generally left running on a perpetual basis, would it make sense to create a guard process whose job it is to monitor the spamd process functioning, and restart it if there is a problem? Apache and MySql for example use this technique (though I am not certain they check for "unresponsive" processes - they may only look for "down" processes)
We upgraded Perl from version 5.8.1 to Perl version 5.8.4 and things seem to have ironed out. I am really really really crossing my fingers.
You know, just when I thought we killed it, it came back. "The Attack of the SpamAssassin Lockups, Part 2!"
well.. i've been hearing and seeing a few reports of spamd still getting looped up even when running with --timeout-tcp and --timeout-child. just today i caught one, 1042 minutes on one child, 11 minutes on another... 1:05pm up 30 days, 4:42, 1 user, load average: 4.28, 3.56, 3.35 134 processes: 126 sleeping, 6 running, 2 zombie, 0 stopped CPU states: 98.5% user, 1.4% system, 0.0% nice, 0.0% idle Mem: 253876K av, 206464K used, 47412K free, 0K shrd, 12868K buff Swap: 522104K av, 221484K used, 300620K free 33372K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 7730 root 19 0 27544 32 4 R 45.1 0.0 1042m spamd 8490 root 19 0 25612 24 4 R 44.4 0.0 11:22 spamd 19905 root 20 0 26352 19M 1480 R 4.4 7.8 0:05 spamd 20512 qmailq 20 0 1924 1924 1148 R 1.4 0.7 0:00 perl5.6.1 20454 root 10 0 1116 1116 848 R 0.2 0.4 0:00 top 1 root 8 0 112 64 48 S 0.0 0.0 0:05 init that tells me that it must be looping inside another eval{} somewhere... otherwise the --timeout-child would kick in. I'll ask, before I go hunting... is there any eval'd code that is not alarmed? razor, pyzor, dcc, rbl, uribl, etc? unfortunately i wasnt able to gather much information on this one since i didnt have debug turned on. this is the second time in as many days that this box has done this. i thought it might have been bayes expiry, so i disabled bayes yesterday and the problem came back... so i have ruled out bayes (specifically auto-expiry and learning) for now and will debug further and hope the problem comes back tommorrow (as bad as that sounds :) d
That is kinda' reassuring, at least the part about us not being the only one it hangs up on still. We run SpamAssassin with the following options... /usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed-ips=127.0.0.1 -- pidfile=/var/run/spamd.pid --max-children=10 Any clues? We don't use Bayes.
tom, what version of spamasassin are you running?? i had build 51813 running, and found that razor2 was not honoring the alaram $timeout because it was not being set correctly due to the conf being razor2_timeout instead of razor_timeout in Plugin/Razor2.pm. old code... $self->{razor2_available} = 0; $mailsaobject->{conf}->{razor2_timeout} = 10; if razor2 had problems, it never received a SIGALM... not sure if this could have caused the long processing times that i was seeing or not... but it could be. i saw a system that had razor2 outbound ports blocked, and spamd would not even start... it would hang trying to prepare_objects() because $timeout was null. couple debug lines i added.. debug: generic: create new razor2 client debug: generic: prepare objects timeout set to [] CTRL+C # once i hard-coded that timeout, it started up just fine. any ways, i just thought i'd share that. I'm going to make sure the boxes that i normally see spamd acting up on are correctly setting razor_timeout. d
I am not sure what build of SpamAssassin that we are running, all I know is that it is version 3.0.1 that was released on October 22, 2004. We haven't tried any of the CVS-downloadable versions.
I don't think we use Razor, we just use the standard DNS Blacklists as well as the internal heuristics/rules of SpamAssassin to detect if a message is spam.
here's something to try. folks, could you change your spamd startup scripts to set: PERL_SIGNALS=unsafe export PERL_SIGNALS before running spamd? (this only works if you're using perl >= 5.8.1. perl 5.6.x already uses this anyway.) also what version of perl is everyone running? and platforms?
> PERL_SIGNALS=unsafe > export PERL_SIGNALS Where would we put this? Along with this stuff? > /usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed- ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10 Last time I checked, SpamD doesn't accept those inputs. I checked the inputs with "SpamD --help" and they weren't in it. We are running Perl version 5.8.4.
Oh, one more thing... RedHat Enterprise 3 and an Intel Pentium 4 HT CPU.
you put that in your startup script for spamd. init script or daemontools run script. platform here is redhat 7.3 (fedoralegacy updates), perl 5.6.1
Like this... /usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed- ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10 PERL_SIGNALS=unsafe export PERL_SIGNALS
Just out of curiosity, what is this supposed to do?
it means that signals (like alarm()) can interrupt even inside a single perl OP.
uh, btw, more like this: PERL_SIGNALS=unsafe export PERL_SIGNALS /usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed- ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10
Ok, I had to do some hacking of the CPanel Service Restart Scripts.
Had to hack /scripts/restartsrv_exim to input the commands inside of SYSTEM() calls before executing the SpamD startup commands. The only thing is, when I execute ENV at the ROOT command line, it doesn't show as one of the entries in the environment.
I don't think that's going to work :( anyone know a way to do it on that env?
I also put it into "/etc/rc.local" to have those two commands load at boot.
Don't know how much help this will be...but here it goes! Nov 10 02:08:11 cleveland spamd[30332]: timeout: (120 second timeout while trying to PROCESS) Nov 10 03:59:14 cleveland spamd[30332]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN2408> line 692. Nov 10 03:59:47 cleveland spamd[30430]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024. Nov 10 03:59:51 cleveland spamd[30312]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN2069> line 562. Nov 10 04:01:38 cleveland spamd[31503]: Failed to run BAYES_40 SpamAssassin test, skipping:__( child processing timeout at /usr/bin/spamd line 1024, <GEN2248> line 732._) Nov 10 04:06:24 cleveland spamd[31503]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN2274> line 76. Nov 10 04:09:49 cleveland spamd[19107]: Failed to run DNS_FROM_AHBL_RHSBL RBL SpamAssassin test, skipping:__( child processing timeout at /usr/bin/spamd line 1024, <GEN3> line 124._) Nov 10 08:18:31 cleveland spamd[20770]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024. Nov 10 10:20:41 cleveland spamd[20765]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN4333> line 53. Nov 10 10:20:43 cleveland spamd[4809]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN82> line 176. Nov 10 10:21:48 cleveland spamd[6644]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN34> line 49. Nov 10 13:08:11 cleveland spamd[5713]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024. Nov 11 07:01:06 cleveland spamd[10433]: Failed to run BAYES_40 SpamAssassin test, skipping:__( child processing timeout at /usr/bin/spamd line 1024, <GEN377> line 1315._) Nov 12 04:55:36 cleveland spamd[6407]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN2823> line 143. Nov 12 04:56:36 cleveland spamd[28635]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN3531> line 37. Nov 12 04:59:04 cleveland spamd[30330]: Failed to run BAYES_40 SpamAssassin test, skipping:__( child processing timeout at /usr/bin/spamd line 1024, <GEN3157> line 59._) Nov 12 04:59:40 cleveland spamd[32083]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024. Nov 12 04:59:58 cleveland spamd[3089]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024. Nov 12 05:02:09 cleveland spamd[30046]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN3> line 50. Nov 12 05:02:20 cleveland spamd[28635]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN3546> line 120. Nov 12 05:03:22 cleveland spamd[3089]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN3370> line 85. Nov 12 05:03:23 cleveland spamd[32083]: syslog() failed: child processing timeout at /usr/bin/spamd line 1024, <GEN3543> line 972. Nov 12 05:04:07 cleveland spamd[30433]: Failed to run RCVD_IN_XBL RBL SpamAssassin test, skipping:__( child processing timeout at /usr/bin/spamd line 1024, <GEN3> line 143._)
hmm. syslog() and DNS lookup failures? are you logging to localhost? where's your DNS server? are you having slowness for DNS lookups or network accesses to the log server?
We use the Data Center's Name Servers as the two main lookup servers, we use localhost as the last resort.
ok, I'm +1 for applying Dallas' patch. possibly upping the 120 sec default, though, at checkin... I think we're in agreement that we need this.
Excuse me? Yes, Dallas' did help some, but it hasn't fixed everything. Are you saying that you (Justin Mason) have some more to contribute to the patch?
btw Tom -- I'd strongly suggest setting up a caching nameserver on localhost. it'll greatly increase scanning speed; a localhost DNS access is much faster than one that hits the wire, even with an unloaded server (which you're not guaranteed this one is).
Tom: Dallas' patch helps fix part of the problem. your bugs are possibly something else, but it's important to get one part out of the way first. I would definitely suggest getting a local caching NS setup asap.
We currently have a Name Server (Bind) setup on the box. That is used to server as DNS for the web sites residing on the box.
> ok, I'm +1 for applying Dallas' patch. possibly > upping the 120 sec default, though, at checkin... i agreee, the --timeout-tcp can be relatively low, but --timeout-child probably needs to be 300 seconds to prevent killing auto-expires of bayes... i'm still waiting for another spamd child to hang.. i have full debugging on and producing 450MB tarball logs each day. :) i will get to the bottom of this soon.
Keep working Dallas, keep your debugging on. Any clues for me? Should I be using the Bind Server that is on the server itself to be querying lookups?
tom, i think your problem is your setting for --timeout-child. move it from 10 seconds (i believe you said you were running with this in one post) to 30 seconds. SA can take 10 seconds to process a message easy under load or slow network tests. see if that makes your child timeouts go away. 30 second timeout without running bayes should just just fine.
Looks good, but before I give a +1, I'd really like to see some type of performance test under normal conditions to verify that there's no degradation. Very needed option! :-)
yeah, a 10-second timeout with network tests and bayes on, and a DNS server on another host, is definitely going to be too slow and produce timeouts. I agree that 30s is more likely to be acceptable, and Tom, you should use that. But I *really* recommend using the local BIND server, or some kind of caching nameserver, on the same box; it's not only better for speed, it reduces network traffic and is generally a very good idea for efficiency. I'm not sure what you might wnat to do there. is it necessary to run the spamds on the same machine as user DNS servers? It would be much more efficient if you can run a spamd on its own machine, with its own caching DNS server for DNS lookups.
Ok, I have changed that. But, it doesn't help the situation that we are in right now. It seems that when SpamAssassin does die, it screws everything up. Mail doesn't flow right, IMAP begins to malfunction (unable to write to mail files), mail messages with file attachments bounce back with "SMTP 421: Message Abandoned" error messages. At this point, the only way to fix this is to reboot the entire box, and this is getting irritating to say the least. Have any clues?
it *should* certainly be possible ofr the rest of the mail system to carry on resiliently even with spamd dead. But that's something you need to ask the CPanel folks -- it's an issue on their end. (we will try to fix the bit about spamd dying/hanging in the first place though. ;)
I am getting sick of having to reboot this damn box everytime SpamD takes a shit. You would think that this would force everything to die, right? kill -9 `ps -aef | grep spam | grep -v grep | awk '{print $2}'` kill -9 `ps -aef | grep imap | grep -v grep | awk '{print $2}'` kill -9 `ps -aef | grep exim | grep -v grep | awk '{print $2}'` Well, it doesn't. It seems like the processes die, but "goast" processes remain. So, even if you restart those services, the mail issues still remain. This is so frustrating. :(
Hi, the patch and the PERL_SIGNALS=unsafe workaround don't work for us, because we are using Red Hat Enterprise Linux 3. It comes with perl 5.8.0. The PERL_SIGNALS environment variable is only honored for perl >= 5.8.1. According to the perl documentation and my own tests it may be preferable to set up the signal handlers differently. Let me quote perlipc: "Note that some networking library functions like gethostbyname() are known to have their own implementations of timeouts which may conflict with your timeouts. If you are having problems with such functions, you can try using the POSIX sigaction() function, which bypasses the Perl safe signals (note that this means subjecting yourself to possible memory corruption, as described above). Instead of setting $SIG{ALRM} try something like the following: use POSIX; sigaction SIGALRM, new POSIX::SigAction sub { die "alarm\n" } or die "Error setting SIGALRM handler: $!\n"; " END OF QUOTE
I just noticed (while running a spamd --help) that in my spamd patch above, the part that adds those 2 new timeout options to the man and --help, it lists --timeout-child as --timeout-timeout - make sure whoever applies that patch changes that on top of the default child timeout. as a side note, i just got a spamd alert from one of my boxes... child pid 2351 running for 7 min, 57 sec -> ALERT: spamd pid 2351 ran for 7 min, 57 sec and was killed! child pid 9321 running for 0 min, 02 sec child pid 9490 running for 0 min, 02 sec child pid 9708 running for 0 min, 00 sec child pid 9711 running for 0 min, 00 sec i'm going through the debug logs right now to figure out what it was doing.
okay, the problem child was 2351. here is its life history. 12:28:54 - pid 2351 spawned. 12:28:58 - first message 12:31:31 - last log entry by pid 2351 12:45:01 - last log entry by spamd parent for pid 2351 cleanup Here are the respective log entries for the 4 lines above. 2004-11-15 12:28:54.966903500 logmsg: [17464] server successfully spawned child process, pid 2351 2004-11-15 12:31:31.173103500 logmsg: [2351] identified spam (10.3/7.0) for $GLOBAL:0 in 1.1 seconds, 1332 bytes. 2004-11-15 12:31:31.173105500 logmsg: [2351] result: Y 10 - DCC_CHECK,HELO_DYNAMIC_DHCP,HELO_DYNAMIC_IPADDR,HTML_50_60,HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,INFO_TLD,LONGWORDS,MIME_HTML_ONLY,MISSING_DATE scantime=1.1,size=1332,mid=<XH5T31I8796S357q6w5Y65CE298X0W8GHE@dce.com>,autolearn=disabled 2004-11-15 12:45:01.061316500 logmsg: [17464] server hit by SIGCHLD 2004-11-15 12:45:01.061320500 logmsg: [17464] handled cleanup of child pid 2351 12:45:01 coincides with when my script sent a kill -9 to it, as it runs on a 5 minute cron. also corresponds with the email received header (in GMT).. Received: (qmail 9714 invoked by uid 0); 15 Nov 2004 18:45:01 -0000 so, to sum up... 2351 processed 23 messsages, start to finish... 8 spam, 15 ham in almost a 3 minute span.. 12:28:58 to 12:31:31. spamd is running with --max-conn-per-child=25 so i dont think it has anything to do with that. BEGINS [root@email spamd]# cat spamd.log2 | grep "\[2351\] config" | grep retrieving |wc -l 23 ENDS [root@email spamd]# cat spamd.log2 | grep "\[2351\] result" | wc -l 23 so the log doesnt indicate any of them never finishing, unless it hung right after message 23 was fully processed? I guess i dont know what all goes on after it sends the 'result' logmsg, it should just go back to read() for input of #24 right? The total life of pid 2351 was 6 minutes and 7 seconds, but ps showed 7 minutes and 57 seconds of CPU usage... during normal operation, the 25 max-conn-per-child will prevent a spamd child cpu utilization to under 30 seconds... so 7 minutes is very high. here is the message i received from cron... child pid 2351 running for 7 min, 57 sec -> ALERT: spamd pid 2351 ran for 7 min, 57 sec and was killed! child pid 9321 running for 0 min, 02 sec child pid 9490 running for 0 min, 02 sec child pid 9708 running for 0 min, 00 sec child pid 9711 running for 0 min, 00 sec here are the stops and starts on the child processes seen above. spamd.log:2004-11-15 12:28:54.966903500 logmsg: [17464] server successfully spawned child process, pid 2351 spamd.log:2004-11-15 12:45:01.061320500 logmsg: [17464] handled cleanup of child pid 2351 spamd.log:2004-11-15 12:44:20.793041500 logmsg: [17464] server successfully spawned child process, pid 9321 spamd.log:2004-11-15 12:46:47.325710500 logmsg: [17464] handled cleanup of child pid 9321 spamd.log:2004-11-15 12:44:39.292863500 logmsg: [17464] server successfully spawned child process, pid 9490 spamd.log:2004-11-15 12:47:20.977168500 logmsg: [17464] handled cleanup of child pid 9490 a pid 9708 is no where to be found? not sure what the deal is here?? maybe this is part of the problem, but i dont know how. its the only thing i can find that is strange here because if ps says its a spamd child, there should be something in the logs about it. [root@email spamd]# grep "pid 9708" * [root@email spamd]# spamdlog:2004-11-15 04:00:39.199207500 logmsg: [17464] server successfully spawned child process, pid 9711 spamdlog:2004-11-15 04:06:08.322575500 logmsg: [17464] handled cleanup of child pid 9711 so now i dont know what to do!! i guess i could add some debugging after the result logmsg and see what its doing.. but now i have to wait for it to happen again. fun fun.
Dallas, have any ideas why when I try to kill those processes, they stick around as zombies?
are the zombies created only from ones that you kill when they hang, or is a zombie created from any spamd child you kill?
> are the zombies created only from ones that you kill when they hang, or is a zombie created from any spamd child you kill? That is the thing, I am not sure where these zombies are occuring. Wait a second, I think I stumbled upon something here. SpamD, at least on our servers, runs in a way that is linked to Exim. So, when Exim receives a mail message, it locks the MailBox File until SpamAssassin is done scanning the mail message and passes it back to Exim. So, when a SpamAssassin process takes a shit and dies improperly, Exim doesn't know about it, thus the MailBox files remain locked, thus unable to be read or written to until the zombie is killed, aka reboot.
'Wait a second, I think I stumbled upon something here. SpamD, at least on our servers, runs in a way that is linked to Exim. So, when Exim receives a mail message, it locks the MailBox File until SpamAssassin is done scanning the mail message and passes it back to Exim. So, when a SpamAssassin process takes a shit and dies improperly, Exim doesn't know about it, thus the MailBox files remain locked, thus unable to be read or written to until the zombie is killed, aka reboot.' I'm not sure what this means. does this setup require that SA do something in particular before Exim will unlock the mailbox? is there a way to deal with these possible failure cases more resiliently? There's a myriad of ways for the SA scanning step to not happen or not complete, some due to bugs in SA, some due to resource shortages, some due to timeouts, and some due to sunspots or similar wierdness. We (obviously) can try to avoid that, but this is computer science, and nothing's perfect. I would suggest using a more resilient system that can cope with that; for example, if the SA process returns without a result, the glue code should treat the message as non-spam, deliver it, and unlock the mailbox. (This is what spamc does, for example.)
This time, I executed SpamAssassin with the --debug flag. I GREPed out the November 16th entries out of the /var/log/maillog file and am downloading the log files now. Hopefully, I will spot something in it that will shed some light on this situation.
Dallas Engelken and Justin Mason, I didn't see much of anything in these log files that sparked my alarm. It is locate at http://www.toms-world.org/privatedownloads/NOV16_spamd.zip I will be sending you an email mesage with a username and password to download this file. If you want to take a look at the files, go ahead.
27144 just looped. here are the final log msgs from it. nothing special. @40000000419b80df2ecb8f94 logmsg: [27144] calling mail->finish @40000000419b80df2ecc27ec debug: [27144] finish: called for Message.pm @40000000419b80df2eebfc5c debug: [27144] finish: completed for Message.pm @40000000419b815d245907e4 logmsg: [10975] handled cleanup of child pid 27144 looks like the parent tried to clean it up, but the child never went away. do you think this could be something to do with signals and running spamd under daemontools? i started up spamd from the command line with an strace no it to see if i can capture yet even more data..
i got a copy of the email that hung on child pid 22626 at 11:39 am... thanks to full debug and 1gig log files :) [root@email old]# ls -la *22626* -rw------- 1 root root 1385 Nov 17 11:39 spamassassin.22626.IQWf2F.tmp but as you can see, it scans fine [root@email old]# cat spamassassin.22626.IQWf2F.tmp | spamc -x 2>&1 >/dev/null [root@email old]# STATUS=$?; echo $STATUS 0 as does the orig-* email that was dumped to disk from the incoming smtp connection with is identical to the spamassassin tmp file. so, we can rule out a specially crafted message for causing this. (progress?) :) here is the debug log, with strace enabled, of that scan.... ------------------- child created.. 2004-11-17 11:39:01.466228500 brk(0x8574000) = 0x8574000 2004-11-17 11:39:01.466231500 logmsg: [20563] server successfully spawned child process, pid 22626 2004-11-17 11:39:01.466289500 brk(0x8575000) = 0x8575000 it took in 5 check requests.. [root@email spamd]# grep 22626 outfile | grep retrieving 2004-11-17 11:39:04.487069500 debug: [22626] config: retrieving prefs for $global from SQL server 2004-11-17 11:39:08.696062500 stat64("/usr/lib/perl5/site_perl/5.6.1/i386-linux/auto/Digest/SHA1/SHA1.bs", debug: [22626] config: retrieving prefs for $global from SQL server 2004-11-17 11:39:12.844853500 debug: [22626] config: retrieving prefs for $global from SQL server 2004-11-17 11:39:16.594829500 debug: [22626] config: retrieving prefs for $global from SQL server 2004-11-17 11:39:21.764975500 brk(0x82cd000debug: [22626] config: retrieving prefs for $global from SQL server it processed the first 4, and never finished on #5. check # 5 started at 11:39:21.764975500 it waited on a bayes lock for 7 seconds... finally syncing at 2004-11-17 11:39:28.485283500 stat64("../lib/Pod/Text.pmc", debug: [22626] bayes: synced databases from journal in 7 seconds: 15500 unique entries (19526 total entries) pyzor timed out in 3 seconds, that takes us to 11:39:32 2004-11-17 11:39:32.422429500 debug: [22626] pyzor: check timed out after 3 secs. uribldns lookups complete successfully 2004-11-17 11:39:32.479788500 debug: [22626] uridnsbl: queries completed: 4 started: 2 last log entry from check #5 at 2004-11-17 11:39:32.503748500 debug: [22626] auto-whitelist: sql-based connected to DBI:mysql:logs:localhost:3306 then, a couple minutes later, child destroyed (by my kill -9) 2004-11-17 11:41:41.995309500 logmsg: [20563] received child_handler CHLD 2004-11-17 11:41:41.995360500 logmsg: [20563] server hit by SIGCHLD 2004-11-17 11:41:41.995423500 logmsg: [20563] handled cleanup of child pid 22626 2004-11-17 11:41:41.995444500 logmsg: [20563] set SIGCHLD 2004-11-17 11:41:41.995486500 logmsg: [20563] Need to Spawn a new CHILD!!! 2004-11-17 11:41:41.999028500 logmsg: [20563] server successfully spawned child process, pid 29029 ------------------- so not sure what happended, but this one hung on AWL lookup... i'll trace another here shortly as they seem to be hanging quite frequently on this box. fyi, AWL works fine on all other messages... and there is nothing wrong with the mysql end of it. mysql> select count(*) from awl; +----------+ | count(*) | +----------+ | 673013 | +----------+ 1 row in set (0.00 sec)
Subject: Re: [review] spamd parent stops accepting requests On Wed, Nov 17, 2004 at 10:34:49AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > last log entry from check #5 at > 2004-11-17 11:39:32.503748500 debug: [22626] auto-whitelist: sql-based > connected to DBI:mysql:logs:localhost:3306 > Interesting, can you tell me what version of MySQL, DBI and DBD::mysql you are running. Anything in the strace from mysql itself? > > then, a couple minutes later, child destroyed (by my kill -9) > Curious, is kill -9 the only way to kill the child? Michael
[root@email spamd]# rpm -qa | grep mysql-server mysql-server-3.23.58-1.73 [root@email spamd]# rpm -qa | grep DBI perl-DBI-1.21-1 [root@email spamd]# rpm -qa | grep DBD perl-DBD-MySQL-1.2219-6 i didnt see much else in the strace that was meaningful. i have not tried anything other than kill -9 on the child, what would you like me to send to it next time? weird thing is, i was having a child hang about every 10 minutes this morning and been sitting here since 11:30am CST waiting for another one to hang, but nothing. its very strange.
hi Dallas - those strace lines don't match the debugs.... are you stracing the children? use -f to follow through forks. also kill -15 is safer than kill -9 for clearing up locks...
i was just using # strace spamd <params> i will add the -f in and try a kill -15 on the next one.
Dallas btw -- what's the CPU on that box? is it hyperthreaded?
yes sir.. its a big one. 2 cpu, 2gig ram. 3:16pm up 22:03, 1 user, load average: 2.30, 1.43, 1.33 157 processes: 153 sleeping, 3 running, 1 zombie, 0 stopped CPU0 states: 33.1% user, 9.3% system, 0.0% nice, 56.4% idle CPU1 states: 14.4% user, 4.4% system, 0.0% nice, 80.0% idle CPU2 states: 18.2% user, 7.3% system, 0.0% nice, 73.3% idle CPU3 states: 34.2% user, 6.2% system, 0.0% nice, 58.4% idle Mem: 2064860K av, 894704K used, 1170156K free, 0K shrd, 121320K buff Swap: 522104K av, 228K used, 521876K free 382836K cached the zombie is my strace through daemontools.. kinda kludgy but it works :) with strace -f i have to cronjob a custom logrotate ever 15 minutes.. lol
Created attachment 2520 [details] 3.0 Patch File Here is a patch against 3.0, it fixes the help typo and ups the default child timeout to 300 secs.
Please review the patch for 3.0, we won't be closing this ticket, because it is just a work around (and seems a good idea) and doesn't fix the actual bug.
+1 Please fix weird spacing around parens. :-)
Let me repeat my statement from comment #76: this patch does *not* work in all instances because of perl 5.8's safe signal handling. According to the docs it would be better to use sigaction SIGALRM instead.
So Sebastian Hagedorn, would this work? Change... $SIG{'ALRM'} = sub { die (" tcp timeout ") }; To... $SIG{'SIGALRM'} = sub { die (" tcp timeout ") }; Change... $SIG{'ALRM'} = sub { die (" child processing timeout ") }; To... $SIG{'SIGALRM'} = sub { die (" child processing timeout ") };
No, I've actually created a patch using SigAction that I'm currently testing. I'm not yet sure if it's actually working ;-) I haven't yet had any timeout messages. I will atach the patch anyway, just to show what I mean.
Yeah, please do so Sebastian Hagedorn, I would like to test it.
Created attachment 2526 [details] Patch using POSIX::SigAction This patch is basically the same, but uses POSIX::SigAction instead ...
Will this patch of yours be able to be applied directly to the SpamAssassin 3.0.1 source, or will I have to apply the original patch by Dallas Engelken then yours?
My patch replaces the previous patches, i.e. it can be applied to a fresh 3.0.1
I can't find the Perl Modules known as POSIX::SigAction. CPAN doesn't seem to know about it.
Should we be using this module or do the equivalent in our code? http://search.cpan.org/~lbaxter/Sys-SigAction/lib/Sys/SigAction.pm From that page: [after some discussion of the different things implemented in different version of perl] "Since sigaction() is not fully functional in perl versions less than 5.8, this module implements equivalent behavior using the standard %SIG array. The version checking and implementation of the 'right' code is handled by this module, so the user does not have to write perl version dependant code. The attrs hashref argument to set_sig_handler() is silently ignored, in perl versions less than 5.8. This module has been tested with perls as old as 5.005 on solaris. It is hoped that with the use of this module, your signal handling behavior can be coded in a way that does not change from one perl version to the next, and that it makes using sigaction() a little easier."
Sebastian Hagedorn, I applied your patch. I am crossing my fingers
Good catch, Sebastian! POSIX::sigaction will, indeed, solve two problems: - 1. deal with internal timeouts using SIGALRM overwriting our one - 2. always use "unsafe" signal handlers, which are essential in this case to interrupt all possible hangs, including regexp complexity ones. however Sys::SigAction does also note that in perl versions prior to 5.8, POSIX::sigaction "does not work correctly". So I agree with Sidney, we need to implement code to do the same thing using %SIG, in our codebase. in the latter case, #1 will be unavoidable. but that seems to have worked for Dallas anyway -- I would surmise because the rules that set alarms (namely DCC/Pyzor/Razor) are happening late enough that the error condition this catches, has already happened by that stage if it was going to happen. I think we need a new patch that does the same thing as Dallas' patch, but using functions that will switch between use of %SIG and POSIX::sigaction depending on $^V. btw, any chance someone with a CLA on file could do this?
Well, my POSIX patch doesn't improve things for me. Even though the unsafe signal handling should be used, no alarms go off. I have now used Sys::SigAction in the hope that it sets up the signal handler better. I have a patch for that, but will wait for my test results before posting it ... if this doesn't work either, I don't know what else to try.
here is another one today... child pid 9638 running for 5 min, 29 sec -> ALERT: spamd pid 9638 ran for 5 min, 29 sec and was killed! i have that script running a kill -9 just in case i cant get to the system at the time. hopefully i can catch one today and kill -15 it instead. 2004-11-19 12:09:32.391200500 debug: [9638] rules: running header regexp tests; score so far=8.273 2004-11-19 12:09:32.391202500 debug: [9638] auto-whitelist: sql-based connected to DBI:mysql:logs:localhost:3306 2004-11-19 12:09:32.724190500 <unfinished ...> 2004-11-19 12:09:32.724220500 [pid 11363] <... read resumed> 0x401f8000, 4096) = ? ERESTARTSYS (To be restarted) 2004-11-19 12:09:32.724240500 [pid 11363] --- SIGALRM (Alarm clock) --- again here, the last line was the sql connection for AWL. half a second later it pulls a sucessful AWL, so who knows.... 2004-11-19 12:09:32.806266500 debug: [11363] auto-whitelist: sql-based connected to DBI:mysql:logs:localhost:3306 2004-11-19 12:09:32.807102500 debug: [11363] auto-whitelist: sql-based get_addr_entry: found existing entry for fw@westiva.com|ip=8.6 2004-11-19 12:09:32.807149500 debug: [11363] auto-whitelist: sql-based fw@westiva.com|ip=8.6 scores 33/240.266 2004-11-19 12:09:32.807316500 debug: [11363] auto-whitelist: AWL active, pre-score: 8.273, autolearn score: 8.273, mean: 7.28078787878788, IP: 8.6.241.121 2004-11-19 12:09:32.808017500 debug: [11363] auto-whitelist: sql-based add_score: new count: 34, new totscore: 248.539 for fw@westiva.com|ip=8.6 2004-11-19 12:09:32.808254500 debug: [11363] auto-whitelist: sql-based finish: disconnected from DBI:mysql:logs:localhost:3306 2004-11-19 12:09:32.808405500 debug: [11363] auto-whitelist: post auto-whitelist score: 7.77689393939394 this is twice i have caught a spamd child that has hung where the last debug line it sent was the auto-whitelist sql connection. hopefully i'll see another one soon.
Subject: Re: [review] spamd parent stops accepting requests On Fri, Nov 19, 2004 at 10:56:40AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > this is twice i have caught a spamd child that has hung where the last debug > line it sent was the auto-whitelist sql connection. hopefully i'll see another > one soon. > I'm trying to remember, was there any indiciation (besides these latest two) that AWL was involved? Could be some sort of deadlock in MySQL. Is there ever a case when one spamd has hung that another takes a slightly longer (but not long enough to trigger the timeout/killer script) amount of time that usual? Michael
the last 2 hangs were on awl.. the others i'm not exactly sure because i didnt have full debugging enabled. i've been running an strace -f with full debug for the last 3 days and have seen the 3 hangs. the one from yesterday i jacked up the log file... i think the last line on it was something about bayes, but since i lost some of that data, i didnt want to trust that read. the weird thing is awl processed fine half a second after that child hung. there is not much in common here... the time that they hang is around 12pm (11:39 and 12:09 so far), but nothing else apparent runs then. mysqld shows no errors or problems that i can see. if i get 1 more that indicated awl, i'll shut that down and see if the problems go away. personally i dont think its awl, and i've taken a look at the code around that dbg and dont see anything that could case it either.
Justin, I think you might have hit the nail right on the head with this statement: > Dallas btw -- what's the CPU on that box? is it hyperthreaded? Tom Parkinson and I were chatting last night about some possibilities for this problem and I suggested that it might help if I assembled a timeline of the problem since we were not experiencing it from day 1. The timeline showed the problem appearing a few days after the installation of an SMP enabled kernel (hyperthreading) on the box and stopped after they removed it for a more stable non-SMP kernel. This went on fine until there was a HD crash on this server. At this point, I believe the SMP enabled kernel may be have been reinstalled and no one caught it. Ever since the system was reloaded from that HD crash, the SpamAssassin issue has been hot and heavy. Patches have slowed it down and reduced the effects, but it's definitely still there. I have just posted a new message in the cPanel forum thread on this issue (http://forums.cpanel.net/showthread.php?p=153528#post153528) asking everyone that is affected by this issue to post their CPU type, whether it is multithreaded or if they have multiple CPU's in their system, and if they are running a multithreaded/multiple CPU aware kernel. I'll ask the same questions here: 1. What type of CPU are you running? 2. Is it multi-threaded or do you have multiple CPU's? 3. Are you running a kernel that is using that feature? My theory is that most everyone that is experiencing this problem will answer yes to at least 1 and 2, but we'll see....
Subject: Re: [review] spamd parent stops accepting requests On Sat, Nov 20, 2004 at 05:23:13AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > 1. What type of CPU are you running? P4 with Hyperthreading > 2. Is it multi-threaded or do you have multiple CPU's? Yes, hyperthreaded > 3. Are you running a kernel that is using that feature? Yes, Linux hoth 2.6.8-24-smp #1 SMP Wed Oct 6 09:16:23 UTC 2004 i686 i686 i386 GNU/Linux > My theory is that most everyone that is experiencing this problem will answer > yes to at least 1 and 2, but we'll see.... I don't fit in this group and I will say that I manage to push this machine VERY hard, 300 msgs/sec at times. Michael
I went off and read the CPanel forum discussion. The more I hear/read from the CPanel folks having a problem, the more I'm convinced that their problems are unrelated. There seems to be no common denominator amoungst those folks, different versions (including 2.6x versions), different setups, bayes off and on, awl off and on, different calling methods. I'm not saying there isn't a problem there, but I don't believe they are related. I suggest opening up a seperate bug to handle the issue CPanel folks are having.
+1 on the idea of splitting off the cpanel discussion -- I agree it sounds like different issues. we should at least keep this bug open, solely to work on the timeout patch.
I have seen a few instances where spamd spawned 100's of child processes until the system (SGI Irix 6.5.22) wasn't able to create new processes. This tends to happen when I run fetchmail to bring in a batch of messages from an address that is in the ghostscript distro (e.g., attracts lots of spam and virus mails). While this is happening DNS queries (made other machines) often fail, so my feeling is that a batch of spam/virus emails will cause a lot of queries for "new" sites (e.g., not in some cache) and bog down the DNS. I use procmail, and I'm not entirely sure where DNS queries occur. Maybe there should be separate queues for messages with headers that have addresses not in the cache so these can be fed thru at a rate that doesn't result in DOS.
another hung child this morning... first one in several days. child pid 29538 running for 6 min, 48 sec -> ALERT: spamd pid 29538 ran for 6 min, 48 sec and was killed! child pid 3662 running for 0 min, 01 sec child pid 3722 running for 0 min, 01 sec child pid 3732 running for 0 min, 01 sec child pid 3764 running for 0 min, 01 sec child pid 3878 running for 0 min, 02 sec child pid 4271 running for 0 min, 01 sec child pid 4435 running for 0 min, 00 sec last few lines of debug from that child... (mostly ones i added to figure this out) 2004-11-23 07:04:13.982902500 logmsg: [29538] identified spam (15.6/5.0) for global:0 in 66.5 seconds, 1398 bytes. 2004-11-23 07:04:13.983016500 logmsg: [29538] result: Y 15 - BAYES_50,HTML_MESSAGE,RCVD_NUMERIC_HELO,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL scantime=66.5,size=1398,mid=<110121498458931012@localhost>,bayes=0.5035849521974,autolearn=no 2004-11-23 07:04:13.983041500 logmsg: [29538] calling status->finish 2004-11-23 07:04:13.983060500 debug: [29538] generic: call_plugins per_msg_finish 2004-11-23 07:04:13.983094500 debug: [29538] call_plugins: run callback on per_msg_finish 2004-11-23 07:04:13.983118500 debug: [29538] generic: callback per_msg_finish started 2004-11-23 07:04:13.983139500 debug: [29538] generic: lets step thorugh the cbpairs 2004-11-23 07:04:13.983160500 debug: [29538] generic: returning 2004-11-23 07:04:13.983180500 debug: [29538] generic: call_plugins per_msg_finish complete 2004-11-23 07:04:13.984008500 debug: [29538] generic: done deleting self keys 2004-11-23 07:04:13.984049500 logmsg: [29538] calling mail->finish 2004-11-23 07:04:13.984082500 debug: [29538] finish: called for Message.pm 2004-11-23 07:04:13.984600500 debug: [29538] finish: completed for Message.pm 2004-11-23 07:04:13.984676500 logmsg: [29538] spamd: done with check, calling client->close 2004-11-23 07:04:13.984759500 logmsg: [29538] spamd: returning from accept_a_conn 2004-11-23 07:04:14.307196500 <unfinished ...> 2004-11-23 07:10:00.693383500 write(2, "logmsg: [19792] handled cleanup "..., 51logmsg: [19792] handled cleanup of child pid 29538 as you can see, it hung after completion of a scan, not awl this time. i think this is the 2nd or 3rd time i've seen this situation now. so maybe i need more debug after returning from accept_a_conn() to figure this out. really the only things that happen after accept_a_conn() are error checking, uid/gid switching, and config copying... right?
i just did some more work on my timeout patch and was wanting a little feedback. basically what i am doing is passing the timeout_child to Message->new() in the $opts and wrote a function in there to calculate run time. then, any time an function completes that had an alarm() on it, i replace the alarm(0) with alarm($self->{msg}->_get_runtime()); so instead of setting alarm(0) which will override the alarm that wraps the check() funciton in /usr/bin/spam, it will set the alarm to ($timeout_child - $runtime), hence keeping the alarm that wraps the check() functional and accurate. as far as i can tell, there are only a handle of alarmed functions, pyzor, razor, dcc, and a couple alarms in Dns.pm. does anyone see anything wrong with this idea before i clean this thing up and do some testing on it. without doing this, a check() is only alarmed to $timeout_child until it hits an alarmed function that sets alarm back to 0. this should prevent that. thanks
Subject: Re: [review] spamd parent stops accepting requests On Tue, Nov 23, 2004 at 02:00:52PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > basically what i am doing is passing the timeout_child to Message->new() in the > $opts and wrote a function in there to calculate run time. then, any time an > function completes that had an alarm() on it, i replace the alarm(0) with > alarm($self->{msg}->_get_runtime()); -1 Message() should have nothing to do with timeouts.
wow, my fingers are frozen from cleaning out the gutters and hanging xmas lights... funciton = function, handle = handful, and /usr/bin/spam = /usr/bin/spamd haha.. proofread? never :)
Subject: Re: [review] spamd parent stops accepting requests On Tue, Nov 23, 2004 at 05:03:09PM -0500, Theo Van Dinter wrote: > Message() should have nothing to do with timeouts. Err, that's M::SA::Message, of course. <G>
okay, help me understand the flow here in spamd, this is what i see.. 1) spamd M::SA->new 2) spamd spawn children 3) spamd accept_a_conn() 4) spamd check() 5) M::SA::parse() (new Message) 6) M::SA::check() (new PerMsgStatus) 7) M::SA::PMS::check() 8) do all the tests so, the optimal place to put a runtime calculator would be where? start_time on that particular scan needs to be passed to either Message.pm or PerMsgStatus.pm for storage to refer to it before the PMS::check() is ran. it cant be stored in $self->{main} unless it was something like $self->{main}->{pid_of_child}->{start_time} and let that child update the start_time everytime it starts a new message. However, i think it would be better if it were stored lower, in something that was destroyed after the check was complete. all placement considerations aside, do you think this is a reasonalbe idea to set the alarm time back to "timeout_child - runtime" instead of zeroing it out?
would this method be better.. in spamd my $status = $spamtest->check($mail); becomes my $status = $spamtest->check($mail,time()); and then in M::SA::check() actually pass the start_time into the PerMsgStatus $opts? PerMsgStatus::_get_runtime() could compare $self->{start_time} against $self->{main}->{timeout_child} which is set via M::SA->new $opts. if that was the case, how would i call _get_runtime() in M::SA::PMS? is there something similar like $self->{main}, $self->{msg}, or $self->{conf} to refer to PerMsgStatus? or do you just have to make a full call to it M::SA::PerMsgStatus::_get_runtime() from within Razor2.pm, SPF.pm, and other places that might need to use it?
'it cant be stored in $self->{main} unless it was something like $self->{main}->{pid_of_child}->{start_time}' actually, no, it's fine to do that -- each M:SA object is single-threaded. but, I've just noticed something in perldoc -f alarm: Only one timer may be counting at once. Each call disables the previous timer, and an argument of 0 may be supplied to cancel the previous timer without starting a new one. The returned value is the amount of time remaining on the previous timer. so IMO it should be simpler to just fix the helper-app timeout code to save any previous timers, and restore them afterwards instead of using "alarm 0", like so: my $oldalarm = alarm($timeout); ....timeout-sensitive code... alarm $oldalarm; #timeout set back to what it was
Alarm::Queued and Alarm::Concurrent are two classes to support multiple alarms, in case the code you would need to add is complex enough to justify another dependency.
justin okay, but we still run into a similar deal here. the alarm that encompasses check() in spamd is what we would want to store as $oldalarm. and in order to restore $oldalarm we would have to get it passed into PerMsgStatus, since $oldalarm would be set per-child, per-check. 'actually, no, it's fine to do that -- each M:SA object is single-threaded.' but the spamd daemon only creates one instance of M::SA->new(), so storing start_time or $oldalarm or anything would not be child (per-check) specific, since M::SA->new() is not called per accept_a_conn(). or maybe i'm missing something here? The only things i see called per-check is M::SA::Message->new() and M::SA::PerMsgStatus->new(), so start_time or $oldalarm would need to be passed in there somehow to have access to it. or i guess instead of alarming spamd with timeout_child, it could be moved down to M::SA::check() instead. This would help people that dont use spamd take advantage of running command line spamassassin with a --timeout-child also. d d
okay, after adding additional debug at the cleanup stages, it looks like it hangs here on copy_config(). and that makes sense because the alarm on check() is shut off at that point, so the child spins forever chewing up CPU time. i added the following logmsg's, so it should have said "lets undef current_user" immediately after it finished copy_config(). logmsg("calling copy_config"); $spamtest->copy_config(\%conf_backup, undef) || die "error returned from copy_config, no Storable module?\n"; } logmsg("lets undef current_user"); undef $current_user; here is the debug showing where it stopped. 2004-11-24 06:00:46.479088500 debug: [21383] generic: returning 2004-11-24 06:00:46.479090500 debug: [21383] generic: done with call_plugin check_end 2004-11-24 06:00:46.483242500 logmsg: [21383] clean message (-0.8/5.0) for $global:0 in 4.5 seconds, 1521 bytes. 2004-11-24 06:00:46.483341500 logmsg: [21383] result: . 0 - AWL,BAYES_05,HTML_10_20,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,NO_REAL_NAME scantime=4.5,size=1521,mid=<13200493.1101297640500.JavaMail.webspher@ccwest-obpapp0>,bayes=0.015070754421061,autolearn=no 2004-11-24 06:00:46.483364500 logmsg: [21383] calling status->finish 2004-11-24 06:00:46.483384500 debug: [21383] generic: call_plugins per_msg_finish 2004-11-24 06:00:46.483404500 debug: [21383] call_plugins: run callback on per_msg_finish 2004-11-24 06:00:46.483424500 debug: [21383] generic: callback per_msg_finish started 2004-11-24 06:00:46.483426500 debug: [21383] generic: lets step thorugh the cbpairs 2004-11-24 06:00:46.483445500 debug: [21383] generic: returning 2004-11-24 06:00:46.483464500 debug: [21383] generic: call_plugins per_msg_finish complete 2004-11-24 06:00:46.484231500 debug: [21383] generic: done deleting self keys 2004-11-24 06:00:46.484253500 logmsg: [21383] calling mail->finish 2004-11-24 06:00:46.484288500 debug: [21383] finish: called for Message.pm 2004-11-24 06:00:46.484602500 debug: [21383] finish: completed for Message.pm 2004-11-24 06:00:46.484708500 logmsg: [21383] spamd: done with check, calling client->close 2004-11-24 06:00:46.484827500 logmsg: [21383] spamd: returning from accept_a_conn 2004-11-24 06:00:46.484876500 logmsg: [21383] returned, evalret is 1 2004-11-24 06:00:46.484897500 logmsg: [21383] do we need to switch uid/gid back? 2004-11-24 06:00:46.484918500 logmsg: [21383] lets restore the config 2004-11-24 06:00:46.484959500 logmsg: [21383] while config in backup, username root 2004-11-24 06:00:46.485000500 logmsg: [21383] while config in backup, learn_to_journal 1 2004-11-24 06:00:46.485019500 logmsg: [21383] calling copy_config 2004-11-24 06:05:01.262876500 logmsg: [8814] received child_handler CHLD 2004-11-24 06:05:01.262949500 logmsg: [8814] server hit by SIGCHLD 2004-11-24 06:05:01.263009500 logmsg: [8814] handled cleanup of child pid 21383 2004-11-24 06:05:01.263035500 logmsg: [8814] set SIGCHLD 2004-11-24 06:05:01.263108500 logmsg: [8814] Need to Spawn a new CHILD!!! 2004-11-24 06:05:01.268627500 logmsg: [8814] server successfully spawned child process, pid 25546 so i guess now i get to dbg out sub copy_config. :)
Subject: Re: [review] spamd parent stops accepting requests > > so i guess now i get to dbg out sub copy_config. :) > What version of Storable are you running? Michael
[root@dev perlmod]# ls -la Storable-2.08.tar.gz -rw-r--r-- 1 dallase dallase 97219 Sep 5 2003 Storable-2.08.tar.gz [root@mailgw root]# grep VERSION /usr/lib/perl5/5.6.1/i386-linux/Storable.pm use vars qw($canonical $forgive_me $VERSION); $VERSION = '2.08'; d
Subject: Re: [review] spamd parent stops accepting requests -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > okay, but we still run into a similar deal here. the alarm that > encompasses check() in spamd is what we would want to store as > $oldalarm. and in order to restore $oldalarm we would have to get it > passed into PerMsgStatus, since $oldalarm would be set per-child, > per-check. No -- it doesn't need to be passed anywhere. The key point: the alarm doco indicates that when you call alarm() again, it'll tell you how much time is left -- so *that value* can be used to reinstate the alarm again. no other var is necessary. I'll go into more detail with some code snippets. in accept_a_conn(): eval { $SIG{'ALRM'} = sub { die (" child processing timeout ") }; alarm $timeout_child if ($timeout_child); check( $1, $2, $start, $remote_hostname, $remote_hostaddr ); }; alarm 0; (same as normal). in dcc_lookup(), and other fns that use alarm() themselves: + my $oldalarm; eval { local $SIG{ALRM} = sub { die "__alarm__\n" }; local $SIG{PIPE} = sub { die "__brokenpipe__\n" }; + $oldalarm = alarm($timeout); # Note: not really tainted, these both come from system conf file. my $path = Mail::SpamAssassin::Util::untaint_file_path ($self->{conf}->{dcc_path}); [... dcc checking] dbg("dcc: got response: $response"); + alarm $oldalarm; $self->cleanup_kids($pid); }; + alarm $oldalarm; so IOW the setting of $timeout_child doesn't have go beyond spamd; the alarm() API allows us to take care of it everywhere else, because alarm() itself will return the details we need. Note that if no alarm is already set, ie. if timeout-child wasn't used, "$oldalarm = alarm($timeout);" will still do the right thing as $oldalarm will be 0. anyway, just to deal with the other q's: > 'actually, no, it's fine to do that -- each M:SA object is > single-threaded.' > > but the spamd daemon only creates one instance of M::SA->new(), so > storing start_time or $oldalarm or anything would not be child > (per-check) specific, since M::SA->new() is not called per > accept_a_conn(). or maybe i'm missing something here? The only > things i see called per-check is M::SA::Message->new() and > M::SA::PerMsgStatus->new(), so start_time or $oldalarm would need to be > passed in there somehow to have access to it. you can just add a var on the M:Sa object before check() starts -- $spamtest->{some_var} = "blah"; M:Sa objects are just hashes. > or i guess instead of alarming spamd with timeout_child, it could be > moved down to M::SA::check() instead. This would help people that dont > use spamd take advantage of running command line spamassassin with a > --timeout-child also. I think the spamd alarm is cleaner. see above. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBpM0aMJF5cimLx9ARAsKHAJ0a6Ae7EtZmmseAXVsihztgCPOmxQCgrBuy va6EKlAixASEUXQ8/9IdpOQ= =qwXu -----END PGP SIGNATURE-----
Why hasn't there been any activity as of the last week?
Subject: Re: [review] spamd parent stops accepting requests On Fri, Dec 03, 2004 at 12:32:34PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From tom@rochen.com 2004-12-03 12:32 ------- > Why hasn't there been any activity as of the last week? Two things: 1) It looks like there might be a better code related solution to the whole timeout issue, so now we need a new patch. 2) I firmly believe that the problems you describe have nothing to do with this particular bug. I suggested that you open another bug to track that issue. Michael
> Why hasn't there been any activity as of the last week? Because nobody has worked on it in the last week.
alright -- I've put in a bit of legwork on this. here's a test script that exercises an evil regexp exponential-time attack, using a particularly awful regexp and piece of data, and then attempts to use various forms of signal handler to deal with it. as far as I know, it should now work on all versions of perl, with and without PERL_SIGNALS=unsafe -- in other words, it works around the issue entirely! but if you have really odd versions of perl (ie. != 5.6.1 or 5.8.4), please run it and, if the output differs from this: using sigaction [or "using SIG"] entering re match eval caught: got sigalrm successfully done re match in 2 secs then post the output. #!/usr/bin/perl -w use strict; trap_sigalrm (\&got_alrm); alarm 2; my $start = time; print "entering re match\n"; my $text = (("o" x 9999) . "x") x 999; eval { $text =~ /o*xo*y/; }; if ($@) { warn "eval caught: $@"; } alarm 0; my $end = time; print "done re match in ".($end - $start)." secs\n"; exit; sub got_alrm { die "got sigalrm successfully\n"; } sub trap_sigalrm { my ($handler) = @_; if ($^V lt v5.8.0) { print "using SIG\n"; $SIG{ALRM} = $handler; } else { print "using sigaction\n"; use POSIX qw(); POSIX::sigaction POSIX::SIGALRM(), new POSIX::SigAction $handler; } }
Created attachment 2553 [details] patch using sigaction and/or %SIG as appropriate ok, this patch: - uses POSIX::sigaction on perl > 5.8.0, and %SIG on earlier perls, so covers all versions - handles saving "higher-level" alarms in the lesser alarms, internally, so the higher-level alarm keeps running after they're completed - fixes some minor things; e.g. all the stuff that looks at $@ should really save a copy of that before calling other functions, in case future perl versions change the semantics of $@ to be more like errno
Subject: Re: [review] spamd parent stops accepting requests > ------- Additional Comments From jm@jmason.org 2004-12-07 18:03 > but if you have really odd versions of perl (ie. != 5.6.1 or 5.8.4), > please run it and, if the output differs from this: Not really odd, but 5.8.0, 5.8.3 and 5.8.5 (all on RedHat 9) work.
Subject: Re: [review] spamd parent stops accepting requests On Tue, Dec 07, 2004 at 06:03:35PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > http://bugzilla.spamassassin.org/show_bug.cgi?id=3828 perl 5.005_03, FreeBSD 4.6-STABLE, doesn't appear to have $^V defined. But other than that and some missing quotes to make strict happy, using SIG entering re match eval caught: got sigalrm successfully done re match in 2 secs perl 5.8.2, OpenBSD 3.5 using sigaction entering re match eval caught: got sigalrm successfully done re match in 2 secs
perl 5.6.1 on redhat 7.3 works. using SIG entering re match eval caught: got sigalrm successfully done re match in 2 secs
perl 5.8.5 on Cygwin works. Win32 perl doesn't do signals (and so doesn't do spamd anyway).
just had a child hang on another box.. took up 7+ min of cpu before it was killed. here is the last 2 dbg mesgs from the log for pid 29170. 2004-12-08 12:31:52.653883500 debug: [29170] check: subtests=__COMMENT_EXISTS,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__HAS_MSGID,__HAS_SUBJECT,__MIME_HTML,__MIME_QP,__MIME_VERSION,__MSGID_OK_DIGITS,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML 2004-12-08 12:31:52.653926500 debug: [29170] call_plugins: run callback on check_end i added the 'run callback' dbg message in SpamAssassin.pm when i was trying to get deeper into the code to find where it was hanging. the common thing i find among all children that hang like this is that it appears to happen in PluginHandler::callback()... i'm going to dbg that function out to see if that is where its hanging up. it has to be somewhere within the foreach loops or the call to &$methodref i would guess. foreach my $plugin (@{$self->{plugins}}) { foreach my $cbpair (@{$self->{cbs}->{$subname}}) { $ret = &$methodref ($plugin, @_); if it made it through those, it should have gave me a logmsg() from spamd. a normal message process currently debugs for me like this... 2004-12-08 12:29:37.523345500 debug: [29170] check: subtests=__COMMENT_EXISTS,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__HAS_MSGID,__HAS_SUBJECT,__MIME_BASE64,__MIME_HTML,__MIME_VERSION,__MSGID_OK_DIGITS,__MSGID_OK_HOST,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_CENTER,__TAG_EXISTS_HTML 2004-12-08 12:29:37.523390500 debug: [29170] call_plugins: run callback on check_end 2004-12-08 12:29:37.523392500 logmsg: [29170] clean message (1.7/7.0) for nuthut@localhost:0 in 3.0 seconds, 18052 bytes. 2004-12-08 12:29:37.523395500 logmsg: [29170] result: . 1 - AWL,DCC_CHECK,HTML_IMAGE_RATIO_04,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,MIME_BASE64_TEXT,NO_REAL_NAME scantime=3.0,size=18052,mid=<110253057458929194@localhost>,autolearn=disabled 2004-12-08 12:29:37.523438500 logmsg: [29170] calling status->finish 2004-12-08 12:29:37.523440500 debug: [29170] generic: call_plugins per_msg_finish 2004-12-08 12:29:37.523442500 debug: [29170] call_plugins: run callback on per_msg_finish 2004-12-08 12:29:37.523461500 debug: [29170] generic: call_plugins per_msg_finish complete 2004-12-08 12:29:37.523463500 debug: [29170] generic: done deleting self keys 2004-12-08 12:29:37.523465500 logmsg: [29170] calling mail->finish 2004-12-08 12:29:37.523467500 debug: [29170] finish: called for Message.pm 2004-12-08 12:29:37.523469500 debug: [29170] finish: completed for Message.pm so after a check_end, it should give me the result and start cleaning up. man i wish i could reproduce this and not wait for it to happen :)
Subject: Re: [review] spamd parent stops accepting requests On Wed, Dec 08, 2004 at 11:10:51AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > 2004-12-08 12:31:52.653926500 debug: [29170] call_plugins: run callback on check_end Hrm. Just to check, are you using any plugins and do check_end? None of the standard ones do.
nah, its pretty much stock, even the rules. init.pre has the basic ones loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF loadplugin Mail::SpamAssassin::Plugin::Razor2
PMS::check() calls a 'check_end', line 252 in v3.0.1 $self->{main}->call_plugins ("check_end", { permsgstatus => $self });
Subject: Re: [review] spamd parent stops accepting requests On Wed, Dec 08, 2004 at 11:27:57AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > PMS::check() calls a 'check_end', line 252 in v3.0.1 > $self->{main}->call_plugins ("check_end", { permsgstatus => $self }); Yeah, it gets called, I was just wondering if the problem was a plugin's check_end function was getting locked up for some reason. If there are no plugins with that function, in theory, the problem is actually in the SA call_plugins() area.
there's quite a bit of code between check_end and that next logmsg. Notably this: # Build the message to send back and measure it my $msg_resp = $status->rewrite_mail(); my $msg_resp_length = length($msg_resp); if ( $version >= 1.3 ) # Spamc protocol 1.3 means multi hdrs are OK { syswrite( $client, "SPAMD/1.1 $resphash{$resp} $resp\r\n" . "Content-length: $msg_resp_length\r\n" . $spamhdr . "\r\n\r\n" . $msg_resp ); } elsif ( $version >= 1.2 ) # Spamc protocol 1.2 means it accepts content-length { syswrite( $client, "SPAMD/1.1 $resphash{$resp} $resp\r\n" . "Content-length: $msg_resp_length\r\n\r\n" . $msg_resp ); } else # Earlier than 1.2 didn't accept content-length { syswrite( $client, "SPAMD/1.0 $resphash{$resp} $resp\r\n" . $msg_resp ); } IOW, syswrite() calls to send data to the client. I think that's more likely, since it involves socket I/O.
k.. so i'll add some more logmsg() calls in between check_end and the next logmsg so we can pinpoint this. next time i happens we should be really close.
Created attachment 2558 [details] new patch using $] instead of $^V a new version of patch 2553 using $] to support 5.005 (not that we do anyway, but it's easy enough ;)
/me smells code duplication here :~) But this can be refactored in trunk, the patch itself looks good from the first glance, I'll apply it to my local spamd and keep that one running with the patch.
root@edinburgh [~/spamassassin/Mail-SpamAssassin-3.0.1]# patch -p0<spamd.patch patching file lib/Mail/SpamAssassin/Dns.pm Hunk #1 succeeded at 668 with fuzz 1 (offset 188 lines). Hunk #2 FAILED at 711. Hunk #4 FAILED at 643. Hunk #5 succeeded at 929 (offset 188 lines). Hunk #6 FAILED at 965. 3 out of 6 hunks FAILED -- saving rejects to file lib/Mail/SpamAssassin/Dns.pm.rej patching file lib/Mail/SpamAssassin/Reporter.pm Hunk #1 succeeded at 283 (offset 133 lines). Hunk #2 FAILED at 305. Hunk #4 FAILED at 228. 2 out of 4 hunks FAILED -- saving rejects to file lib/Mail/SpamAssassin/Reporter.pm.rej patching file lib/Mail/SpamAssassin/Locker/Flock.pm Hunk #1 succeeded at 63 with fuzz 1 (offset -4 lines). Hunk #2 succeeded at 76 with fuzz 2. Hunk #3 FAILED at 97. 1 out of 3 hunks FAILED -- saving rejects to file lib/Mail/SpamAssassin/Locker/Flock.pm.rej patching file lib/Mail/SpamAssassin/Util.pm Hunk #1 succeeded at 1089 with fuzz 2 (offset -40 lines). patching file lib/Mail/SpamAssassin/Plugin/SPF.pm Hunk #1 FAILED at 222. 1 out of 1 hunk FAILED -- saving rejects to file lib/Mail/SpamAssassin/Plugin/SPF.pm.rej can't find file to patch at input line 362 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |Index: lib/Mail/SpamAssassin/Plugin/Razor2.pm |=================================================================== |--- lib/Mail/SpamAssassin/Plugin/Razor2.pm (revision 111184) |+++ lib/Mail/SpamAssassin/Plugin/Razor2.pm (working copy) -------------------------- File to patch: Skip this patch? [y] Skipping patch. 6 out of 6 hunks ignored patching file spamd/spamd.raw Hunk #1 succeeded at 79 with fuzz 1 (offset -3 lines). Hunk #2 succeeded at 180 (offset -10 lines). Hunk #3 succeeded at 508 (offset -10 lines). Hunk #4 succeeded at 527 with fuzz 2 (offset -10 lines). Hunk #5 FAILED at 559. Hunk #6 FAILED at 707. Hunk #7 succeeded at 983 (offset -81 lines). Hunk #8 succeeded at 1095 (offset -10 lines). Hunk #9 succeeded at 1392 (offset -84 lines). Hunk #10 succeeded at 2364 with fuzz 2 (offset -61 lines). 2 out of 10 hunks FAILED -- saving rejects to file spamd/spamd.raw.rej What is going on?
Subject: Re: [review] spamd parent stops accepting requests On Fri, Dec 10, 2004 at 01:11:19PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > |Index: lib/Mail/SpamAssassin/Plugin/Razor2.pm > > What is going on? You're applying a patch meant for trunk (3.1.0) against 3.0.1, which won't work in this case.
Which "trunk" version do you recommend?
Subject: Re: [review] spamd parent stops accepting requests On Mon, Dec 13, 2004 at 08:32:23AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Which "trunk" version do you recommend? There is only 1 trunk. :)
HUH! There are four trunk versions available for downloading... http://cvs.apache.org/snapshots/spamassassin/ "Nightly snapshots of the latest SVN trunk code can be downloaded from cvs.apache.org. These are built nightly from the SVN repository, and may be unstable!" There you go, "trunk" versions.
we really need user confirmation that the patch helps before we can mark this as [review], so taking it out of that state
well, i've finally debugged this thing completely down as far as i can go without jumping into Storable further. looks like it hangs on a Storable::dclone() call in SpamAssassin.pm.. not sure if it was the first dclone() or second in the elsif, i can debug that further if needed, but it'll be another few days before it happens again i believe. 2004-12-14 13:06:47.896133500 debug: [30830] generic: while k=bayes_sql_username and v= and i= in source 2004-12-14 13:06:47.896154500 debug: [30830] generic: while k=whitelist_to and v=HASH(0x937f5fc) and i=HASH in source 2004-12-14 13:06:47.896159500 debug: [30830] generic: calling dclone of v 2004-12-14 13:06:47.896248500 debug: [30830] generic: while k=hashcash_doublespend_path and v=__userstate__/hashcash_seen and i= in source 2004-12-14 13:06:47.896316500 debug: [30830] generic: while k=whitelist_from and v=HASH(0x9345588) and i=HASH in source 2004-12-14 13:06:47.896321500 debug: [30830] generic: calling dclone of v 2004-12-14 13:06:47.896424500 debug: [30830] generic: while k=pyzor_timeout and v=3 and i= in source 2004-12-14 13:06:47.896448500 debug: [30830] generic: while k=hashcash_doublespend_file_mode and v=0700 and i= in source 2004-12-14 13:06:47.896469500 debug: [30830] generic: while k=lock_method and v= and i= in source 2004-12-14 13:06:47.896490500 debug: [30830] generic: while k=bayes_file_mode and v=0700 and i= in source 2004-12-14 13:06:47.896511500 debug: [30830] generic: while k=user_scores_ldap_password and v= and i= in source 2004-12-14 13:06:47.896515500 debug: [30830] generic: while k=pyzor_max and v=5 and i= in source 2004-12-14 13:06:47.896535500 debug: [30830] generic: while k=tflags and v=HASH(0x9346be4) and i=HASH in source 2004-12-14 13:06:47.896558500 debug: [30830] generic: calling dclone 2004-12-14 13:15:00.400402500 logmsg: [1222] handled cleanup of child pid 30830
okay, exact same loop point again... its happening in the dclone call while k=tflags. 2004-12-14 18:54:08.281807500 debug: [2650] generic: while k=whitelist_from and v=HASH(0x934ce00) and i=HASH in source 2004-12-14 18:54:08.281875500 debug: [2650] generic: while k=pyzor_timeout and v=3 and i= in source 2004-12-14 18:54:08.281899500 debug: [2650] generic: while k=hashcash_doublespend_file_mode and v=0700 and i= in source 2004-12-14 18:54:08.281902500 debug: [2650] generic: while k=lock_method and v= and i= in source 2004-12-14 18:54:08.281926500 debug: [2650] generic: while k=bayes_file_mode and v=0700 and i= in source 2004-12-14 18:54:08.281948500 debug: [2650] generic: while k=user_scores_ldap_password and v= and i= in source 2004-12-14 18:54:08.281951500 debug: [2650] generic: while k=pyzor_max and v=5 and i= in source 2004-12-14 18:54:08.282002500 debug: [2650] generic: while k=tflags and v=HASH(0x934e074) and i=HASH in source 2004-12-14 18:54:08.282002700 debug: [2650] generic: calling dclone in elsif 2004-12-14 19:00:01.453042500 logmsg: [5095] handled cleanup of child pid 2650 what can i do from here?
ok, that makes sense -- Storable is a recent addition to spamd's operation, and it's mostly (if not all) XS code. Dallas -- I don't have tuits right now, but could you try wrapping the spamd call to copy_config (around line 1000) in an alarm? it *may* be able to catch that. if SIGALRM goes off after 10 secs without copy_config() having completed, it's in trouble -- and the child can just exit early, and the master spamd will respawn a new one. that should work around it until we can eventually ditch the use of Storable.
What version of Storable do you have? Most likely the latest one, but I thought I would ask anyway.
someone didnt read the whole bug :) see comment #128
I looked for that, but there are too many comments! I have Storable-2.13 installed here. It might be worthwhile to try a newer version.
Created attachment 2566 [details] wrap copy_config() call in an alarm ok, here's a patch that does what I suggested. if the copy_config() call takes over 10 seconds (ie. it's hung), the child logs a message and exits immediately, so the master can respawn a new one.
Moving to 3.0.3 milestone. The bug is no longer in review status and we should get 3.0.2 out soon.
hmm... not sure I like that. I think we may have a fix here, and I'd like to get it in if it works... (I'd be fine with leaving it to .3, though, if it doesn't.)
215k calls to copy_config without one loop... that tells me Storable 2.13 is most likely the fix, as the alarm surrounding copy_config() has never triggered since. fyi, i had that function writing a log file on success/fails... [root@email spamd]# grep exit copy_config.log [root@email spamd]# wc -l < copy_config.log 215706 can other that experience this please post your Storable version, and then try updating if you are not on 2.13? leaving copy_config() alarm'd is probably a good thing just in case people dont have the latest storable running. or i guess 2.13 could be required in the Makefile? d
ok, marking this closed. the resolution: ensure the version of Storable is > 2.13.
Subject: Re: spamd parent stops accepting requests > ok, marking this closed. the resolution: ensure the version of Storable is > 2.13. I hope that Storable was part of the correct-version-checking patch... :-)
good point Loren ;) r125194. btw, when I said versions > 2.13, I mean versions >= 2.13. oops.
Storable $VERSION = '2.13' here and I'm still experiencing this problem. :(
Ok, I have another problem. I am running a "snapshot" build of SpamAssassin and it has solved a few issues, specifically, the mailbox locks and the SMTP 421: Message Abandoned error. We have been able to go several days without the issue. Unfortunately, there is a new problem. After a few days, SpamAssassin forgets how to filter mail. It is as if SpamAssassin no longer knows how to kill spam. No tests are done, nothing but the SpamAssassin headers are added saying that no tests were done to the mail message. Killed one bug, another one surfaced.
Tom, please open a separate bug for that issue -- it's important that issues are kept separate. (also, when you do so, be sure to attach a sample of what a problematic message looks like after scanning, with all headers. and note how exactly you're running SA -- through amavisd? MailScanner? from procmail directly? etc.)
I think I'm seeing the same problems as other people in this report. Every once in a while I find my mail-system in a state where netstat show a lot lines like tcp 755 0 localhost:spamd localhost:36353 CLOSE_WAIT - Also spamd doesn't except anymore new connections. Note that this kind of behaviour also happens with sockets, also with only local tests (-L). I tried with Storable-2.12/2.13 that doesn't help. The timeout patch does help, I haven't seen any blocks like this anymore (in the limited testing period that is).