3828 – spamd parent stops accepting requests

Bug 3828 - spamd parent stops accepting requests

Summary: spamd parent stops accepting requests

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	spamc/spamd (show other bugs)
Version:	3.0.0
Hardware:	Other other

Importance:	P5 major
Target Milestone:	3.0.3
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (1):	3782 (view as bug list)
Depends on:
Blocks:

Reported:	2004-09-27 06:46 UTC by Dallas Engelken
Modified:	2005-03-16 20:58 UTC (History)
CC List:	9 users (show)

Attachment	Type	Actions	Submitter/CLA Status
Spam message possibly causing the problem	text/plain	None	Alessandro Ranellucci
Adds 2 new timeouts to spamd	patch	None	Dallas Engelken
3.0 Patch File	patch	None	Michael Parker
Patch using POSIX::SigAction	patch	None	Sebastian Hagedorn
patch using sigaction and/or %SIG as appropriate	patch	None	Justin Mason
new patch using $] instead of $^V	patch	None	Justin Mason
wrap copy_config() call in an alarm	patch	None	Justin Mason
Show Obsolete (4) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dallas Engelken 2004-09-27 06:46:10 UTC

I'm currently trying to track down another spamd problem(s) that seems to be 
pretty random.  Happens about once every day on a 4 box spamd cluster.   One of 
the boxes will just drop to 0 load and spamd will apparently be hung.  
restarting spamd makes it take right off again.   Its random in which box it 
occurs on.  

Also, over the weekend I caught another box on another network doing similar 
things.  I have some ps and strace info from each, not sure how much it will 
help until i can debug this out further.  There is nothing in the spamd log 
that indicates it has stopped.  Weird thing I always see when it happens is the 
port 783/tcp shows as filtered until i restart spamd.

Here is a little info.  First, a process listing, 15 children.  Parent is 17752.


[root@spamd2 root]# ps auxwww  | grep spamd
root      1217  0.0  0.0  1340   60 ?        S    Jun23   0:00 supervise spamd
root     17752  0.0  2.8 23652 21888 ?       S    Sep22   0:05 /usr/bin/perl -
T -w /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 
0.0.0.0 -q -x -m 15 --max-conn-per-child 25
root     28432  0.0  3.6 29472 27596 ?       S    Sep22   0:07 spamd child
root     28511  0.0  3.2 26916 25136 ?       S    Sep22   0:08 spamd child
root     28520  0.0  3.2 26252 24532 ?       S    Sep22   0:06 spamd child
root     28702  0.0  3.6 29756 27884 ?       S    Sep22   0:07 spamd child
root     28832  0.0  3.2 27016 25220 ?       S    Sep22   0:04 spamd child
root     28842  0.0  3.2 26892 25100 ?       S    Sep22   0:02 spamd child
root     28882  0.0  3.0 25116 23400 ?       S    Sep22   0:01 spamd child
root     26097  0.0  3.2 26780 24916 ?       S    04:11   0:06 spamd child
root     26142  0.0  3.1 25848 24052 ?       S    04:11   0:03 spamd child
root     26147  0.0  3.5 28636 26840 ?       S    04:11   0:07 spamd child
root     26148  0.0  3.2 26416 24540 ?       S    04:11   0:08 spamd child
root     26213  0.0  3.3 27600 25804 ?       S    04:12   0:04 spamd child
root     26238  0.0  3.1 26132 24284 ?       S    04:12   0:02 spamd child
root     26283  0.0  3.0 25292 23580 ?       S    04:12   0:02 spamd child
root     26308  0.0  3.1 26032 24300 ?       S    04:13   0:03 spamd child
root     28492  0.0  0.0  1764  592 pts/4    R    08:12   0:00 grep spamd

trying to echo|spamc never returns... stracing the parent shows it in pause() 
state.

[root@spamd2 root]# strace -p 17752
pause(

checking the tcp socket shows it is filtered?

[root@spamd2 root]# nmap -sT -p783 localhost
Interesting ports on localhost.localdomain (127.0.0.1):
Port       State       Service
783/tcp    filtered    hp-alarm-mgr

restarting spamd makes it available again...

[root@spamd2 root]# qmailctl restart
[root@spamd2 root]# nmap -sT -p783 localhost
Interesting ports on localhost.localdomain (127.0.0.1):
Port       State       Service
783/tcp    open        hp-alarm-mgr


-------------------------------------------

and another... although this looks a little different.  very long run times on 
spamd procs (looks like DoS again)?  

 11:55pm  up 175 days,  3:50,  1 user,  load average: 2.52, 2.34, 2.29
142 processes: 136 sleeping, 4 running, 2 zombie, 0 stopped
CPU states: 99.0% user,  0.9% system,  0.0% nice,  0.0% idle
Mem:   253876K av,  232444K used,   21432K free,       0K shrd,   40336K buff
Swap:  522104K av,  177936K used,  344168K free                   64336K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
31797 root      15   0 26480  196    52 R    46.9  0.0 16781m spamd
23746 root      14   0 26052   36     4 R    46.7  0.0 940:22 spamd

echo|spamc results in nothing.

[root@email root]# echo | spamc

process listing running with 2 childs.

[root@email root]# ps auxwww | grep spamd
root      8153  0.0  0.3 25024  904 ?        S    Aug26   0:49 /usr/bin/perl -
T -w /usr/bin/spamd --syslog-socket=none -q -x -m 4 --max-conn-per-child 25
root     10541  0.0  2.7 25384 6920 ?        S    20:03   0:00 spamd child
root     25227 16.1  8.1 27432 20664 ?       S    23:56   0:05 spamd child

strace the parent shows wait(), children show read().

[root@email root]# strace -p 8153
wait4(-1,

[root@email root]# strace -p 25227
read(6,  <unfinished ...>

[root@email root]# strace -p 10541
read(6,

nothing in the spamd debug log until spamd is restarted.

--------------------------------------------

hopefully i'll discover more as it happens again.

Comment 1 Theo Van Dinter 2004-09-27 09:30:29 UTC

Subject: Re:  New: spamd parent stops accepting requests

On Mon, Sep 27, 2004 at 06:46:11AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> trying to echo|spamc never returns... stracing the parent shows it in pause() 
> state.

the parent sits in a sleep() loop waiting for signals, such as sigchld and
such.  the children are what do all the work in 3.0.0.

Comment 2 Dallas Engelken 2004-09-28 06:14:26 UTC

any dev on right now that wants to take a look at this?  i have it pulled from
the DNS for a bit to try and figure out whats up.  it sure looks to be a problem
with the parent...

[root@spamd1 /]# telnet localhost 783
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
quit

this never responds and nothing logs until i restart spamd.

Comment 3 Dallas Engelken 2004-09-28 06:41:59 UTC

hmm.. looks like this may be the problem.  netstat reports 16 tcp sessions
established to 783/tcp, which would indicate all the children are busy right?  
however, there is no load, and nothing in the log ever showed max children reached.

[root@spamd1 log]# netstat -na | grep :783 | grep ESTABLISHED | wc -l
     16
[root@spamd1 log]# netstat -na | grep :783 | grep CLOSE_WAIT | wc -l
    123

so the parent wont accept any new connection because it thinks all the children
are busy, when they really arent?  ps shows they are all sleeping...

15515 ?        S      0:03 /usr/bin/perl -T -w /usr/bin/spamd
--syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 -q -x -m 15
--max-conn-per-child 75
 9076 ?        S      0:10 spamd child
 9127 ?        S      0:09 spamd child
 9152 ?        S      0:15 spamd child
28765 ?        S      0:20 spamd child
29097 ?        S      0:15 spamd child
29139 ?        S      0:12 spamd child
20022 ?        S      0:26 spamd child
20177 ?        S      0:23 spamd child
20191 ?        S      0:23 spamd child
20444 ?        S      0:17 spamd child
20611 ?        S      0:10 spamd child
20682 ?        S      0:08 spamd child
20696 ?        S      0:18 spamd child
20698 ?        S      0:07 spamd child
21294 ?        S      0:04 spamd child

strace shows they are in a read().

[root@spamd1 log]# strace -p 21294
read(6,

is something just not cleaning up properly?  here again is where a timeout on
spamd children would be a good thing.  the parent could sigterm the child if it
hasnt finished in x seconds.

Comment 4 Theo Van Dinter 2004-09-28 09:05:57 UTC

Subject: Re:  spamd parent stops accepting requests

On Tue, Sep 28, 2004 at 06:14:27AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> the DNS for a bit to try and figure out whats up.  it sure looks to be a problem
> with the parent...

The parent has nothing to do with the connection, period.  All the parent does
is spawn children when necessary.  The children are the ones which accept the
connection, read in the request, do the processing, etc.

> [root@spamd1 log]# strace -p 21294
> read(6,

what is fd 6?  do an lsof on the process.

Comment 5 Dallas Engelken 2004-09-28 09:21:44 UTC

err.. now i'm going to have to wait until it happens again.  seems to be 
popping up about every 2 hours or so.   i've come up with a lot more info in 
the meantime, maybe this will be more helpful?

spamd1 has gone away 3 times this morning..  everytime it does (and comes back 
usually 30+ minutes later), you see the 'bad protocol' dbg message in the log.

from 9:11 to 9:27 GMT it stopped.

@4000000041592ae93729be04 2004-09-28 09:11:59 [20696] i: connection from 
p35n0.ruraltel.net [24.225.0.35] at port 60169
@4000000041592e7105aea894 2004-09-28 09:27:03 [9152] i: bad protocol: header 
error: (closed before headers)

from 9:27 to 10:48 GMT it stopped.

@4000000041592e7f0cdbaa04 2004-09-28 09:27:17 [9152] i: connection from 
p36n0.ruraltel.net [24.225.0.36] at port 50510
@40000000415941a135605c2c 2004-09-28 10:48:55 [9152] i: bad protocol: header 
error: (closed before headers)

from 10:49 to 12:26 GMT it stopped.

@40000000415941ac2dd64e6c 2004-09-28 10:49:06 [9152] i: connection from 
p35n0.ruraltel.net [24.225.0.35] at port 59890
@40000000415958941293ed94 2004-09-28 12:26:50 [20696] i: bad protocol: header 
error: (closed before headers)

when it was down this last time, i decided to kill -9 15515 a.k.a the 
last 'spamd child' on my ps listing.

once i did that, approx 130 email's were processed by the spamd children.  The 
weird thing is, this box was completely out of the DNS and would not be 
accepting 'new' connections.  all those message had to be sitting in the tcp 
sockets waiting to be processed, and once the childs became available, they 
flooded through?

also saw a warn() and a dbg about a dead tcp socket, after i killed that one 
spamd child.

@40000000415981e41bfd1f7c Use of uninitialized value in subroutine entry 
at /usr/lib/perl5/5.6.1/i386-linux/Socket.pm line 312.
@40000000415981e41c00924c 2004-09-28 15:23:06 [24774] i: error: Bad arg length 
for Socket::unpack_sockaddr_in, length is 0, should be 16 
at /usr/lib/perl5/5.6.1/i386-linux/Socket.pm line 312._ Transport endpoint is 
not connected, continuing

so the 'bad protocol: header error' seems to be the constant factor here of 
when it comes back online by itself.  question is what happens to the 
previous 'connection from' that causes everything else to stop...

from spamd4 yesterday..

i: result: Y  5 - 
BAYES_20,DCC_CHECK,HTML_90_100,HTML_IMAGE_RATIO_04,HTML_MESSAGE,MIME_BOUND_NEXTP
ART,OPTING_OUT,URIBL_SBL,URIBL_WS_SURBL 
scantime=0.3,size=8798,mid=<10962760108851042@mail2.ruraltel.net>,bayes=0.098527
1200051288,autolearn=no
i: connection from p34n0.ruraltel.net [24.225.0.34] at port 47706
i: bad protocol: header error: (closed before headers)
i: connection from p35n0.ruraltel.net [24.225.0.35] at port 38263

on spamd3 yesterday

i: result: Y 11 - 
BAYES_50,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL 
scantime=0.2,size=2940,mid=<200409270903.i8R93xH6016221@vabeach.matchmaker.com>,
bayes=0.573393109179598,autolearn=no
i: connection from p35n0.ruraltel.net [24.225.0.35] at port 37916
i: bad protocol: header error: (closed before headers)
i: connection from p35n0.ruraltel.net [24.225.0.35] at port 38257

from spamd4 on the 25th..

i: result: .  0 - 
AWL,BAYES_00,HTML_MESSAGE,HTML_NONELEMENT_00_10,HTML_TEXT_AFTER_BODY,HTML_TEXT_A
FTER_HTML,MIME_HTML_ONLY,TO_ADDRESS_EQ_REAL,URIBL_SBL 
scantime=1.3,size=43426,mid=<24628953.1096103652348.JavaMail.root@straight-poop-
mail.com>,bayes=0,autolearn=no
i: connection from p36n0.ruraltel.net [24.225.0.36] at port 59415
i: bad protocol: header error: (closed before headers)
i: connection from p35n0.ruraltel.net [24.225.0.35] at port 45541

is anyone else seeing this 'header error'?

Comment 6 Dallas Engelken 2004-09-28 10:54:40 UTC

here is what i have found.  take a look at the ps listing below and you'll see
the START time on the first 7 children is 03:XX and the last 8 children is
12:XX.. so i lsof a couple of the 03:XX children to see what the hell they are
still doing.   looks like FD6 is the tcp socket to the spamc client that
initiated the request.

gdb shows the stale children are in a read() state still

0x420dae14 in read () from /lib/i686/libc.so.6

there is no active spamc client on the other side of the TCP connection however,
since they run with a 60 second timeout, so i assume that is why its stuck in
read() forever?

here is the ps and lsof's of a could old children.   once all the children get
old, spamd no longer processes requests, hence my problem.


root     18297  0.0  0.3 27240 2940 ?        S    Sep27   0:02 /usr/bin/perl -T
-w /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0
-q -x -m 15 --max-conn-per-child 75
root      2015  0.0  3.1 31756 24420 ?       S    03:07   0:22 spamd child
root      2577  0.0  3.1 31080 23740 ?       S    03:28   0:14 spamd child
root      2586  0.0  3.3 33092 25668 ?       S    03:29   0:14 spamd child
root      3069  0.0  3.2 32356 24848 ?       S    03:46   0:09 spamd child
root      3210  0.0  3.0 30588 22964 ?       S    03:53   0:05 spamd child
root      3216  0.0  2.9 30260 22592 ?       S    03:53   0:05 spamd child
root      3342  0.0  2.9 30468 22692 ?       S    03:59   0:04 spamd child
root     30254  3.4  3.2 32376 24968 ?       S    12:29   0:20 spamd child
root     30308  3.5  3.4 33440 26084 ?       S    12:30   0:21 spamd child
root     30367  3.6  3.4 33444 26248 ?       S    12:30   0:20 spamd child
root     30379  3.2  3.3 32880 25488 ?       S    12:31   0:17 spamd child
root     30439  3.0  3.2 32020 24628 ?       S    12:32   0:14 spamd child
root     30544  2.8  3.0 30736 23264 ?       S    12:33   0:11 spamd child
root     30755  2.5  2.9 29956 22196 ?       S    12:37   0:04 spamd child
root     30771  2.1  2.8 29312 21460 ?       S    12:37   0:03 spamd child
root     30948  0.0  0.0  1764  588 pts/2    R    12:40   0:00 grep spamd

[root@spamd3 spamd]# lsof -p 3216
..
spamd   3216 root    0r   CHR        1,3           67004 /dev/null
spamd   3216 root    1w  FIFO        0,5            2768 pipe
spamd   3216 root    2w  FIFO        0,5            2768 pipe
spamd   3216 root    3r   REG        3,2   73077  328406 /usr/bin/spamd
spamd   3216 root    4w  FIFO        0,5            2768 pipe
spamd   3216 root    5u  IPv4 1362568002             TCP *:783 (LISTEN)
spamd   3216 root    6u  IPv4 1382114401             TCP
24.225.0.50:783->p35n0.ruraltel.net:60137 (ESTABLISHED)

[root@spamd3 spamd]# lsof -p 3342
..
spamd   3342 root    0r   CHR        1,3           67004 /dev/null
spamd   3342 root    1w  FIFO        0,5            2768 pipe
spamd   3342 root    2w  FIFO        0,5            2768 pipe
spamd   3342 root    3r   REG        3,2   73077  328406 /usr/bin/spamd
spamd   3342 root    4w  FIFO        0,5            2768 pipe
spamd   3342 root    5u  IPv4 1362568002             TCP *:783 (LISTEN)
spamd   3342 root    6u  IPv4 1382113269             TCP
24.225.0.50:783->p35n0.ruraltel.net:59749 (ESTABLISHED)

Comment 7 Theo Van Dinter 2004-09-28 11:26:27 UTC

Subject: Re:  spamd parent stops accepting requests

On Tue, Sep 28, 2004 at 10:54:41AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> there is no active spamc client on the other side of the TCP connection however,
> since they run with a 60 second timeout, so i assume that is why its stuck in
> read() forever?

in theory, the read() should return an EOF since the other end disconnected
and then the spamd code would catch the failure and exit.

it seems like what is happening is that all of the children get made busy with
requests, and requests keep coming in and end up getting queued by the OS
waiting for a child to do an accept().  At some point, the client decides to
give up, and either the OS doesn't clear the connection out of the queue, and/or
IO::Socket isn't doing the right thing.

There's 2 getline() commands.  The first gets the request and explicitly
checks for an error.  The other is a loop to read in the messge:

  while ( $_ = $client->getline() ) {

which lets the while check for the error.  Right after the loop, there's
a call to parse(), without an error check to make sure that there was
actually any data read for the message.  I don't know what parse() will
do at that point, but there should be a quick check in spamd to make
sure a message was actually read in and return an error if there wasn't.

BTW: This seems related to bug 3782.

Comment 8 Dallas Engelken 2004-09-28 11:32:01 UTC

> BTW: This seems related to bug 3782.

"You are not authorized to access bug #3782."

Comment 9 Justin Mason 2004-09-28 12:07:44 UTC

strace output would be *really* useful here.  (although possibly hard to gather :( )

Comment 10 Tom Parkison 2004-10-04 09:20:12 UTC

I was in the FreeNode SpamAssassin IRC chat room and a guy in there told me to 
post some of this info.

Apparently, we are experiencing the same issues.

OS: RedHat Enterprise Linux ES release 3 (Taroon Update 2)
Kernel: 2.4.21-20.ELsmp
Processor: Intel(R) Pentium(R) 4 CPU 2.80GHz HT
Perl: 5.8.1

We are running CPanel as our Control Panel on the server.  We have been in 
talks with CPanel Technical Support and they seem to be at a loss as to why 
this is occuring.

We are running Exim 4.42 with SpamAssassin 3.0.0.

The ROOT SpamD daemon runs, spawns 5 SpamD child processes, and the SpamC 
processes run as users.

For some reason, child SA processes will hang, causing SMTP 421 Lost Incoming 
Connection errors that pretty much stops the flow of all email on the server.  
The only way we have been able to kill SpamAssassin when this happens is to 
execute "kill -9 `ps -aef | grep spam | grep -v grep | awk '{print $2}'`" and 
then invoke "spamd -d" to restart it.  Sometimes this works, sometimes not.

I have grabbed some STRACE data from the server and the SpamC processes seem 
to be stuck in a RECV state, as indicated in the data below.

# strace -f -p 13506
Process 13506 attached - interrupt to quit
recv(3,
# strace -f -p 14119
Process 14119 attached - interrupt to quit
recv(4,
# strace -f -p 14106
Process 14106 attached - interrupt to quit
recv(3,
# strace -f -p 14084
Process 14084 attached - interrupt to quit
recv(3,

Hopefully, this will be the ammo you need to start killing this bug.

Comment 11 Dallas Engelken 2004-10-04 09:46:07 UTC

well, i can only find this happening in 1 location right now, and really the
only difference is this location runs a spamd cluster, and dns round robins the
requests...  spamd is invoked like this:

 /usr/bin/spamd --syslog-socket=none -A 127.0.0.0/8,24.225.0.0/24 -i 0.0.0.0 -
q -x -m 15 --max-conn-per-child 75

and spamc clients connect from remote hosts like this...

/usr/bin/spamc -dspamd -H -p783 -t30 -x -u pathall@ruraltel.net

the differences between this setup and all other setups i have that run just
fine are:

1) -dspamd vs -dlocalhost
2) -H flag on spamc call
3) -A flag is 127.0.0.0/8,24.225.0.0/24 vs just 127.0.0.0/8

i thought maybe it was the network infastructure causing it since i have only
seen this happen at a single location, but there doesnt appear to be anything
wrong here.  The 3 MTA's are plugged into the same switch that the spamd cluster
is on, so there is no routing or anything involved here.

i'm still trying to get better debug of it,  its very difficult to do.  i have
written some spamd child timeouts into the spamd daemon that appear to be
working good also.  prevents me from having to restart spamd now to regain mail
flow.   i'll try and clean it up and add a patch for review if anyone is
interested... however, this is not the answer to fixing this problem, but may be
the answer to help prevent runaway children that we have seen in the past.

d

Comment 12 Justin Mason 2004-10-04 10:35:06 UTC

could everyone seeing this bug, post their spamd and spamc command line
switches?  perhaps it's something to do with -H...

Comment 13 Alessandro Ranellucci 2004-10-04 11:09:41 UTC

I'm experiencing the same problem with spamc hanging on my mail server:
bash-2.05b# ps aux | grep spamc
qscand   23346  0.0  0.0  1904    4 ?        S    19:40   0:00 /usr/bin/spamc -c -f -d 192.168.1.23 -u 
user@example.com
bash-2.05b# strace -f -p 23346
recv(3, 

In the mean time a spamd child hangs on the spamd server:
bash-2.05a# ps aux | grep spamd
spamc  19519  104.1  2.7    22084  14128  ??  R     6:40PM  11:46.04 spamd child                                                                                                       
spamc  19507   0.0  1.8    19728   9608  ??  S     6:40PM   0:01.82 perl -T -w /usr/bin/spamd -i 
192.168.1.23 -A 192.168.1.18 -x -u spamc
spamc  19509   0.0  2.4    20676  12576  ??  S     6:40PM   0:00.82 spamd child                                                                                                       
spamc  19510   0.0  2.3    20260  12132  ??  S     6:40PM   0:00.40 spamd child                                                                                                       

Finally, spamc times out while spamd keeps eating CPU. This is the situation on the spamd server:
bash-2.05a# lsof | grep 19519
perl      19519  spamc  cwd    VDIR       14,2       6528        2 / (/dev/disk0s2)
perl      19519  spamc    0r   VCHR        3,2        0t0 34932996 /dev/null
perl      19519  spamc    1w   VCHR        3,2  0t1012094 34932996 /dev/null
perl      19519  spamc    2w   VCHR        3,2  0t1012094 34932996 /dev/null
perl      19519  spamc    4r   VREG       14,2      73087   377323 / (/dev/disk0s2)
perl      19519  spamc    5w   VREG       14,2     502932   377858 / (/dev/disk0s2)
perl      19519  spamc    6u   inet 0x02f85f4c        0t0      TCP 192.168.1.23:783 (LISTEN)
perl      19519  spamc    7u   inet 0x02f874cc        0t0      TCP 192.168.1.23:783->192.168.1.18:
48499 (CLOSE_WAIT)

When all the spamd childs are blocked like this one my spamd server just stops processing requests to 
port 783.

I set up a cron every 30 minutes to run "kill `ps -aex | grep 'spamd child' | grep '??  R' | grep -v 'grep' | 
awk '{print $1}'`". This is a temporary solution of course.

Comment 14 Theo Van Dinter 2004-10-04 12:10:12 UTC

Can someone experiencing this issue post the following:

ps auwwx (or -ef) output
strace the spamd pids
lsof the spamd pids
strace the spamc pids
lsof the spamc pids


I'm going to guess the lsof of spamc shows it waiting on results from spamd, and spamd is waiting for 
input from spamc.  but no one's posted the above information all in the same comment yet. ;)

Comment 15 Dallas Engelken 2004-10-04 13:59:08 UTC

quick note.  Alessandro uses -A flags on spamd, and -d flag on spamc just like i
do.  dunno if that plays any importance or not.

>> Can someone experiencing this issue post the following:

i think have provided most of what i can so far. 

stracing the spamd children are easy, but you will find there are no spamc
clients to strace because they have all went away due to the spamc timeout.  
dunno why spamd children still hang onto a socket with no client on the other
end??   my guess is since this is so random (2 hours to make 15 children stale
doing 10k/hour), the only time it is really occuring is when the spamc timeout
is encountered.   i can probably set the timeout to 1 sec and try that theory
myself.

>> ps auwwx (or -ef) output

see initial post.. note the start times.

>> strace the spamd pids

see initial post.. "strace the parent shows wait(), children show read()."

>> lsof the spamd pids

see comment #5 for lsof on 2 children.  you want full lsof or just filehandles?

>> strace the spamc pids

impossible or very hard to match spamc requests with which spamd server they
went to since it is round-robin.   also, i never see stale spamc clients because
of the spamc timeout.   guess i could up the timeout limit to something really
huge and try and catch it.

>> lsof the spamc pids

same thing as above.

Comment 16 Tom Parkison 2004-10-05 10:31:40 UTC

By the way, we normally start SpamAssassin by running "spamd -d".

Comment 17 Tom Parkison 2004-10-05 15:14:44 UTC

Also, we have put in this...

/sbin/iptables -A INPUT -i lo -s 0.0.0.0/0 -d 0.0.0.0/0 -j ACCEPT

I don't know if it helped or hurt the issue.

Comment 18 Alessandro Ranellucci 2004-10-06 02:17:13 UTC

Created attachment 2420 [details]
Spam message possibly causing the problem

I think (but I'm not sure) that the problem is caused by something related to
messages. This message caused spamc/spamd to hang. Does this help?

Comment 19 Dallas Engelken 2004-10-06 06:13:31 UTC

hmm.. i didnt have any problem scanning that message.

2004-10-06 13:12:22 [29542] i: connection from localhost.localdomain [127.0.0.1]
at port 40076
2004-10-06 13:12:22 [29542] i: processing message
<002c01c4ab50$dcb5ce3a$a9ca1cf0@ytp> for root:0.
2004-10-06 13:12:25 [29542] i: identified spam (8.3/4.0) for root:0 in 3.0
seconds, 50633 bytes.
2004-10-06 13:12:25 [29542] i: result: Y  8 -
BODY_8BITS,EXTRA_MPART_TYPE,FORGED_OUTLOOK_TAGS,HTML_60_70,HTML_FONT_BIG,HTML_MESSAGE,HTML_TITLE_EMPTY,MIME_HTML_MOSTLY,MPART_ALT_DIFF,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_BY_IP,UNWANTED_LANGUAGE_BODY
scantime=3.0,size=50633,mid=<002c01c4ab50$dcb5ce3a$a9ca1cf0@ytp>,autolearn=no

Comment 20 Dallas Engelken 2004-10-06 08:32:07 UTC

Created attachment 2421 [details]
Adds 2 new timeouts to spamd

this patch will add a tcp connect timeout option (--timeout-tcp) which is the
time between when the socket was made and then the first header.   If a client
makes a connection but never sends headers, their connection will close server
side in --timeout-tcp seconds.

after it sees the first header, the --timeout-child kicks in.  if the time it
takes from after it process the header till the time its ready to return
exceeds this timeout, the connection will be closed.



usage... 
-T or --timeout-tcp   = tcp socket timeout before seeing headers.
-t or --timeout-child = child processing timeout after seeing headers.

testing..  just using low values here to demonstrate...

 
[root@dev root]# spamd -d -T5 -t10

[root@dev root]# telnet localhost 783
Connected to localhost.
Escape character is '^]'.
SPAMD/1.0 79 Timeout: (5 second socket timeout reading input from client)
Connection closed by foreign host.

[root@dev root]# telnet localhost 783
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
PROCESS SPAMC/1.1
SPAMD/1.0 79 Timeout: (10 second timeout while trying to PROCESS)
Connection closed by foreign host.

logging timeouts through logmsg shows...

2004-10-04 21:04:24 [6406] i: connection from localhost.localdomain [127.0.0.1]
at port 39424
2004-10-04 21:04:29 [6406] i: timeout: (5 second socket timeout reading input
from client)
2004-10-04 21:04:31 [6407] i: connection from localhost.localdomain [127.0.0.1]
at port 39425
2004-10-04 21:04:45 [6407] i: timeout: (10 second timeout while trying to
PROCESS)


for those of you experiencing this bug, this is a good patch for you until we
can resolve the real issue.

Comment 21 Michael Owen 2004-10-06 09:59:48 UTC

I've applied that patch to an instance of ours that was quitting every day or
two, so we'll see how it handles. 

Solaris 9 with Perl 5.8.3 on a V240 in case it helps. 

using --timeout-tcp=5 and --timeout-child=50

Comment 22 Dallas Engelken 2004-10-06 12:06:46 UTC

hrmm... i finally have a dead spamd on a new box.  here are the details of what
i found.  as you can see, it was using 99% CPU for a long ass time before i
noticed it.  <insert bad sysadmin joke here>.

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
29579 root      15   0 24620  14M     8 R    99.0  2.9  76:06 spamd

last 2 lines of a gdb spamd 29279 shows..
> Loaded symbols for /lib/libnss_dns.so.2
> 0x08099ddd in Perl_runops_standard ()

couple of my children are now zombies... the other one is using all the
resources on the box.   

[root@mailgw root]# ps auxww | grep spamd
root      8908  0.0  0.0 23972  184 ?        S    Sep24   0:01 /usr/bin/perl -T
-w /usr/bin/spamd --syslog-socket=none -r /var/run/spamd.pid -A 127.0.0.0/8 -i
127.0.0.1 -q -D -x -m 3 --max-conn-per-child 10
root     24796  0.0  0.0     0    0 ?        Z    Sep30   0:03 [spamd <defunct>]
root     24929  0.0  0.0     0    0 ?        Z    Sep30   0:03 [spamd <defunct>]
root     29579 86.0  2.9 28764 15036 ?       R    12:14  76:37 spamd child

[root@mailgw root]# strace -p 29579
[root@mailgw root]#

good thing is, i modified my spamd dbg to log the pid on all lines... so i can
easily find what this one was doing.....

[root@mailgw spamd]# grep "\[29579\]" * | tai64nlocal | tail -4
Oct 06 12:23:10.599032 check[29579]: [ 8] already have 1 discovery servers
Oct 06 12:23:10.599094 check[29579]: [ 8] Checking with Razor Discovery Server
66.151.150.12
Oct 06 12:23:10.599172 check[29579]: [ 6] No port specified, using 2703
Oct 06 12:23:10.599208 check[29579]: [ 5] Connecting to 66.151.150.12 ...

appears to have hung on razor2 :(

[root@mailgw spamd]# razor-admin -v
Razor Agents 2.40, protocol version 3

i'm gonna upgrade to 2.61 to see if this goes away.

dallas

Comment 23 Dallas Engelken 2004-10-06 12:25:31 UTC

> appears to have hung on razor2 :(

okay.. scratch the razor2 comment.  i'm not 100% sure since i thought i had 'sub
dbg' changed to include pid on every line and i didnt... damn!

i was trying to trace it through my debug log and its just too tough to do
without pids on every line.  i have my dbg() changed so i can figure this out
next time hopefully.

Comment 24 Justin Mason 2004-10-06 13:42:10 UTC

*** Bug 3782 has been marked as a duplicate of this bug. ***

Comment 25 Tom Parkison 2004-10-08 21:27:00 UTC

Any more ideas regarding this bug?  Are we closer to a fix?

Comment 26 Tom Parkison 2004-10-14 07:21:29 UTC

I keep hearing about a patch, but when will it be released to the general 
public.

Comment 27 Justin Mason 2004-10-14 10:05:01 UTC

the patch is on comment 20:
http://bugzilla.spamassassin.org/attachment.cgi?id=2421&action=view

people who are seeing this issue -- does this patch fix it/work around it?   we
need feedback as to whether it works or not, before we can consider
incorporating that fix into 3.0.1.

BTW, a hang in Perl_runops_standard() generally would indicate a perl compiler
bug, afaik... trying other versions of perl may help.

Comment 28 Michael Owen 2004-10-14 10:58:10 UTC

We haven't had any problems since applying this, and it's been a little over a
week, with SA processing ~200k emails a day. Even if this patch doesn't solve
everyone's issues, it's a good thing to have, as the more options the better.

Comment 29 Lazar Obradovic 2004-10-15 02:14:22 UTC

Subject: Re:  spamd parent stops accepting requests

As I believe I mentioned, I made a similar patch with alarms, and it
worked perfectly.
 
I have tried patch you've sent and it also works good, so I guess it's
ready for 3.0.1.

Comment 30 Tom Parkison 2004-10-17 20:15:08 UTC

Yeah, but that is the thing.  We install SpamAssassin directly for the CPAN 
Distributions, custom hacks are really something we want to play with.  When 
this patch becomes part of the standard CPAN Distribution, we will install it.

Comment 31 Tom Parkison 2004-10-17 20:16:00 UTC

Correction...
"custom hacks AREN'T really something we want to play with"

Comment 32 Lukas Karlsson 2004-10-18 07:24:09 UTC

I am having a similar problem to what other people have described.  I am running FreeBSD and am 
installing SpamAssassin using the FreeBSD port p5-Mail-SpamAssassin.  spamd runs fine for a few 
hours but then stops spawning new children and stops accepting connections (all spamc connections 
timeout until spamd is restarted).

I've applied the patch and will see if this changes the behavior.  I would say that if I can go 24 hours 
without a problem then the patch fixed it.

/l

Comment 33 Ben Thomas 2004-10-23 07:51:24 UTC

Dallas's patch seems to be providing some much needed relief for the systems
that I've seen affected by this issue. Furthermore, I think its just good form
to have the spamd children timeout after a period of time by default. I know
this is kind of a kludge for the real issue, but it seems workable and
reasonable. It's too bad that 3.0.1 is already out; I would've liked to see this
patch go into that release.

Comment 34 Tom Parkison 2004-10-23 11:31:21 UTC

Dallas Engelken,

http://bugzilla.spamassassin.org/show_bug.cgi?id=3828#c20

Any plans on making that one SpamD/SpamC timeout patch of yours available for 
SpamAssassin 3.0.1?  Or is the file that you patch the same in 3.0.1 that a 
new patch won't be needed for 3.0.1?

Comment 35 Justin Mason 2004-10-23 12:29:30 UTC

Sounds like we need to apply this patch for 3.0.2, since it's now widely
reported to fix the bug

Comment 36 Michael Parker 2004-10-23 13:20:04 UTC

Subject: Re:  spamd parent stops accepting requests

On Sat, Oct 23, 2004 at 12:29:31PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> ------- Additional Comments From jm@jmason.org  2004-10-23 12:29 -------
> Sounds like we need to apply this patch for 3.0.2, since it's now widely
> reported to fix the bug
> 

You mean, widely reported to work around the bug.  I don't believe it
fixes anything does it?  It just adds a timeout(s) so if things get
wedged they will become unwedged.

I would hope we could find the actual cause of the bug.

I do however think the timeout patches are a good thing.

Michael

Comment 37 Tom Parkison 2004-10-24 16:52:13 UTC

I know.  The patch has helped, but hasn't totally fixed the issue.  I thank 
Dallas Engelken for engineering this patch, it has helped.

Now, we need to get the developers of SpamAssassin to know about this issue.  
This issue has been an issue since the days of SpamAssassin 2.64.

I don't know why they didn't program in a timeout from the beginning.  Every 
good programmer knows to code in a way to escape potential unending loops, and 
in this case, an unending loop is occuring.  A loop that is sitting, waiting 
for data from a TCP/IP link that has been severed in a way that SpamAssassin 
doesn't know that it has been severed.

Comment 38 Dallas Engelken 2004-11-02 12:33:56 UTC

quick note that i'm seeing on this patch.  a low timeout (ie 1-90 seconds) will 
timeout a child running a bayes auto-expire almost every time since the auto-
expire usually takes more than 90 seconds on a 5-10mb _toks.

so if you run a low child-timeout, your bayes database may never auto expire... 
i run a manual sa-learn --force-exire via cron, as this off loads it from the 
child and lets the child process mail instead of expiring toks.  

FYI, if you run spamc with a timeout and use auto expiry, you are most likely 
seeing spamc exit 74 assuming you are running spamc with -x flag... so this 
will most likely cause a tempfail.

hopefully auto-expiry through spamd will someday be the job of the parent and 
not the children, this would help a bunch in this situation.

Comment 39 Theo Van Dinter 2004-11-02 12:47:10 UTC

Subject: Re:  spamd parent stops accepting requests

On Tue, Nov 02, 2004 at 12:33:57PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> hopefully auto-expiry through spamd will someday be the job of the parent and 
> not the children, this would help a bunch in this situation.

I actually have some plans about spamd for 3.1, where the expires won't
happen in the scan/process child, but in a secondary "helper" child which the
parent spawns off.  ie: the scan child passes the information to the parent
than a sync/expire needs to occur, then the parent passes that to the helper
child to actually do the work.

Comment 40 Rob Boudrie 2004-11-03 05:17:54 UTC

Setting the tcp and child timouts to 10 seems to have solved the problem 
(thanks to ben@cpanel.net for pointing this out), and that seems to have solved 
the problem.

Since spamd is a critical process generally left running on a perpetual basis, 
would it make sense to create a guard process whose job it is to monitor the 
spamd process functioning, and restart it if there is a problem?  Apache and 
MySql for example use this technique (though I am not certain they check 
for "unresponsive" processes - they may only look for "down" processes)

Comment 41 Tom Parkison 2004-11-09 14:20:15 UTC

We upgraded Perl from version 5.8.1 to Perl version 5.8.4 and things seem to 
have ironed out.

I am really really really crossing my fingers.

Comment 42 Tom Parkison 2004-11-09 14:44:24 UTC

You know, just when I thought we killed it, it came back.  "The Attack of the 
SpamAssassin Lockups, Part 2!"

Comment 43 Dallas Engelken 2004-11-10 11:26:09 UTC

well.. i've been hearing and seeing a few reports of spamd still getting looped
up even when running with  --timeout-tcp and --timeout-child.  just today i
caught one,

1042 minutes on one child,  11 minutes on another... 

  1:05pm  up 30 days,  4:42,  1 user,  load average: 4.28, 3.56, 3.35
134 processes: 126 sleeping, 6 running, 2 zombie, 0 stopped
CPU states: 98.5% user,  1.4% system,  0.0% nice,  0.0% idle
Mem:   253876K av,  206464K used,   47412K free,       0K shrd,   12868K buff
Swap:  522104K av,  221484K used,  300620K free                   33372K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 7730 root      19   0 27544   32     4 R    45.1  0.0  1042m spamd
 8490 root      19   0 25612   24     4 R    44.4  0.0  11:22 spamd
19905 root      20   0 26352  19M  1480 R     4.4  7.8   0:05 spamd
20512 qmailq    20   0  1924 1924  1148 R     1.4  0.7   0:00 perl5.6.1
20454 root      10   0  1116 1116   848 R     0.2  0.4   0:00 top
    1 root       8   0   112   64    48 S     0.0  0.0   0:05 init

that tells me that it must be looping inside another eval{} somewhere...
otherwise the --timeout-child would kick in.  I'll ask, before I go hunting...
is there any eval'd code that is not alarmed?  razor, pyzor, dcc, rbl, uribl, etc?

unfortunately i wasnt able to gather much information on this one since i didnt
have debug turned on.  this is the second time in as many days that this box has
done this.  i thought it might have been bayes expiry, so i disabled bayes
yesterday and the problem came back...  so i have ruled out bayes (specifically
auto-expiry and learning) for now and will debug further and hope the problem
comes back tommorrow (as bad as that sounds :)

d

Comment 44 Tom Parkison 2004-11-11 09:49:55 UTC

That is kinda' reassuring, at least the part about us not being the only one 
it hangs up on still.

We run SpamAssassin with the following options...
/usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed-ips=127.0.0.1 --
pidfile=/var/run/spamd.pid --max-children=10

Any clues?  We don't use Bayes.

Comment 45 Dallas Engelken 2004-11-12 07:27:03 UTC

tom, what version of spamasassin are you running??

i had build 51813 running, and found that razor2 was not honoring the alaram
$timeout because it was not being set correctly due to the conf being
razor2_timeout instead of razor_timeout in Plugin/Razor2.pm.

old code...

  $self->{razor2_available} = 0;
  $mailsaobject->{conf}->{razor2_timeout} = 10;

if razor2 had problems, it never received a SIGALM... not sure if this could
have caused the long processing times that i was seeing or not... but it could
be.  i saw a system that had razor2 outbound ports blocked, and spamd would not
even start... it would hang trying to prepare_objects() because $timeout was
null.  couple debug lines i added..

debug: generic: create new razor2 client
debug: generic: prepare objects timeout set to []
CTRL+C
# 

once i hard-coded that timeout, it started up just fine.  any ways, i just
thought i'd share that.   I'm going to make sure the boxes that i normally see
spamd acting up on are correctly setting razor_timeout.

d

Comment 46 Tom Parkison 2004-11-12 08:38:08 UTC

I am not sure what build of SpamAssassin that we are running, all I know is that
it is version 3.0.1 that was released on October 22, 2004.

We haven't tried any of the CVS-downloadable versions.

Comment 47 Tom Parkison 2004-11-12 08:41:56 UTC

I don't think we use Razor, we just use the standard DNS Blacklists as well as
the internal heuristics/rules of SpamAssassin to detect if a message is spam.

Comment 48 Justin Mason 2004-11-12 11:00:58 UTC

here's something to try.  folks, could you change your spamd startup scripts to
set:

   PERL_SIGNALS=unsafe
   export PERL_SIGNALS

before running spamd?  (this only works if you're using perl >= 5.8.1.  perl
5.6.x already uses this anyway.)
also what version of perl is everyone running? and platforms?

Comment 49 Tom Parkison 2004-11-12 13:25:34 UTC

> PERL_SIGNALS=unsafe
> export PERL_SIGNALS
Where would we put this?  Along with this stuff?

> /usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed-
ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10

Last time I checked, SpamD doesn't accept those inputs.  I checked the inputs 
with "SpamD --help" and they weren't in it.

We are running Perl version 5.8.4.

Comment 50 Tom Parkison 2004-11-12 13:27:51 UTC

Oh, one more thing...

RedHat Enterprise 3 and an Intel Pentium 4 HT CPU.

Comment 51 Dallas Engelken 2004-11-12 13:35:25 UTC

you put that in your startup script for spamd.  init script or daemontools run
script.

platform here is redhat 7.3 (fedoralegacy updates), perl 5.6.1

Comment 52 Tom Parkison 2004-11-12 14:01:07 UTC

Like this...
/usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed-
ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10 PERL_SIGNALS=unsafe
export PERL_SIGNALS

Comment 53 Tom Parkison 2004-11-12 14:02:13 UTC

Just out of curiosity, what is this supposed to do?

Comment 54 Justin Mason 2004-11-12 14:45:53 UTC

it means that signals (like alarm()) can interrupt even inside a single perl OP.

Comment 55 Justin Mason 2004-11-12 14:46:46 UTC

uh, btw, more like this:

PERL_SIGNALS=unsafe
export PERL_SIGNALS
/usr/bin/spamd -d --timeout-child=10 --timeout-tcp=5 --allowed-
ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=10

Comment 56 Tom Parkison 2004-11-12 15:10:56 UTC

Ok, I had to do some hacking of the CPanel Service Restart Scripts.

Comment 57 Tom Parkison 2004-11-12 15:14:57 UTC

Had to hack /scripts/restartsrv_exim to input the commands inside of SYSTEM()
calls before executing the SpamD startup commands.

The only thing is, when I execute ENV at the ROOT command line, it doesn't show
as one of the entries in the environment.

Comment 58 Justin Mason 2004-11-12 15:19:19 UTC

I don't think that's going to work :( anyone know a way to do it on that env?

Comment 59 Tom Parkison 2004-11-12 15:20:58 UTC

I also put it into "/etc/rc.local" to have those two commands load at boot.

Comment 60 Tom Parkison 2004-11-12 15:31:09 UTC

Don't know how much help this will be...but here it goes!

Nov 10 02:08:11 cleveland spamd[30332]: timeout: (120 second timeout while
trying to PROCESS)
Nov 10 03:59:14 cleveland spamd[30332]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN2408> line 692.
Nov 10 03:59:47 cleveland spamd[30430]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024.
Nov 10 03:59:51 cleveland spamd[30312]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN2069> line 562.
Nov 10 04:01:38 cleveland spamd[31503]: Failed to run BAYES_40 SpamAssassin
test, skipping:__( child processing timeout  at /usr/bin/spamd line 1024,
<GEN2248> line 732._)
Nov 10 04:06:24 cleveland spamd[31503]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN2274> line 76.
Nov 10 04:09:49 cleveland spamd[19107]: Failed to run DNS_FROM_AHBL_RHSBL RBL
SpamAssassin test, skipping:__( child processing timeout  at /usr/bin/spamd line
1024, <GEN3> line 124._)
Nov 10 08:18:31 cleveland spamd[20770]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024.
Nov 10 10:20:41 cleveland spamd[20765]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN4333> line 53.
Nov 10 10:20:43 cleveland spamd[4809]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN82> line 176.
Nov 10 10:21:48 cleveland spamd[6644]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN34> line 49.
Nov 10 13:08:11 cleveland spamd[5713]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024.
Nov 11 07:01:06 cleveland spamd[10433]: Failed to run BAYES_40 SpamAssassin
test, skipping:__( child processing timeout  at /usr/bin/spamd line 1024,
<GEN377> line 1315._)
Nov 12 04:55:36 cleveland spamd[6407]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN2823> line 143.
Nov 12 04:56:36 cleveland spamd[28635]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN3531> line 37.
Nov 12 04:59:04 cleveland spamd[30330]: Failed to run BAYES_40 SpamAssassin
test, skipping:__( child processing timeout  at /usr/bin/spamd line 1024,
<GEN3157> line 59._)
Nov 12 04:59:40 cleveland spamd[32083]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024.
Nov 12 04:59:58 cleveland spamd[3089]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024.
Nov 12 05:02:09 cleveland spamd[30046]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN3> line 50.
Nov 12 05:02:20 cleveland spamd[28635]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN3546> line 120.
Nov 12 05:03:22 cleveland spamd[3089]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN3370> line 85.
Nov 12 05:03:23 cleveland spamd[32083]: syslog() failed:  child processing
timeout  at /usr/bin/spamd line 1024, <GEN3543> line 972.
Nov 12 05:04:07 cleveland spamd[30433]: Failed to run RCVD_IN_XBL RBL
SpamAssassin test, skipping:__( child processing timeout  at /usr/bin/spamd line
1024, <GEN3> line 143._)

Comment 61 Justin Mason 2004-11-13 17:00:38 UTC

hmm.  syslog() and DNS lookup failures?  are you logging to localhost? where's
your DNS server? are you having slowness for DNS lookups or network accesses to
the log server?

Comment 62 Tom Parkison 2004-11-13 17:06:47 UTC

We use the Data Center's Name Servers as the two main lookup servers, we use 
localhost as the last resort.

Comment 63 Justin Mason 2004-11-13 18:08:55 UTC

ok, I'm +1 for applying Dallas' patch.  possibly upping the 120 sec default,
though, at checkin...

I think we're in agreement that we need this.

Comment 64 Tom Parkison 2004-11-13 18:11:20 UTC

Excuse me?  Yes, Dallas' did help some, but it hasn't fixed everything.  Are 
you saying that you (Justin Mason) have some more to contribute to the patch?

Comment 65 Justin Mason 2004-11-13 18:18:40 UTC

btw Tom -- I'd strongly suggest setting up a caching nameserver on localhost. 
it'll greatly increase scanning speed; a localhost DNS access is much faster
than one that hits the wire, even with an unloaded server (which you're not
guaranteed this one is).

Comment 66 Justin Mason 2004-11-13 18:25:50 UTC

Tom: Dallas' patch helps fix part of the problem.

your bugs are possibly something else, but it's important to get one part out of
the way first.

I would definitely suggest getting a local caching NS setup asap.

Comment 67 Tom Parkison 2004-11-13 19:12:30 UTC

We currently have a Name Server (Bind) setup on the box.  That is used to 
server as DNS for the web sites residing on the box.

Comment 68 Dallas Engelken 2004-11-14 07:55:46 UTC

> ok, I'm +1 for applying Dallas' patch.  possibly 
> upping the 120 sec default, though, at checkin...

i agreee, the --timeout-tcp can be relatively low, but --timeout-child probably
needs to be 300 seconds to prevent killing auto-expires of bayes... 

i'm still waiting for another spamd child to hang.. i have full debugging on and
producing 450MB tarball logs each day.  :)

i will get to the bottom of this soon.

Comment 69 Tom Parkison 2004-11-14 08:14:43 UTC

Keep working Dallas, keep your debugging on.

Any clues for me?  Should I be using the Bind Server that is on the server 
itself to be querying lookups?

Comment 70 Dallas Engelken 2004-11-14 08:30:27 UTC

tom, i think your problem is your setting for --timeout-child.  move it from 10
seconds (i believe you said you were running with this in one post) to 30
seconds.    SA can take 10 seconds to process a message easy under load or slow
network tests.  

see if that makes your child timeouts go away.  30 second timeout without
running bayes should just just fine.

Comment 71 Daniel Quinlan 2004-11-14 16:58:26 UTC

Looks good, but before I give a +1, I'd really like to see some type of
performance test under normal conditions to verify that there's no degradation.

Very needed option!  :-)

Comment 72 Justin Mason 2004-11-14 17:00:35 UTC

yeah, a 10-second timeout with network tests and bayes on, and a DNS server on
another host, is definitely going to be too slow and produce timeouts.
I agree that 30s is more likely to be acceptable, and Tom, you should use that.

But I *really* recommend using the local BIND server, or some kind of caching
nameserver, on the same box; it's not only better for speed, it reduces network
traffic and is generally a very good idea for efficiency.

I'm not sure what you might wnat to do there.  is it necessary to run the spamds
on the same machine as user DNS servers?  It would be much more efficient if you
can run a spamd on its own machine, with its own caching DNS server for DNS lookups.

Comment 73 Tom Parkison 2004-11-14 17:07:04 UTC

Ok, I have changed that.

But, it doesn't help the situation that we are in right now.  It seems that 
when SpamAssassin does die, it screws everything up.  Mail doesn't flow right, 
IMAP begins to malfunction (unable to write to mail files), mail messages with 
file attachments bounce back with "SMTP 421: Message Abandoned" error messages.

At this point, the only way to fix this is to reboot the entire box, and this 
is getting irritating to say the least.

Have any clues?

Comment 74 Justin Mason 2004-11-14 17:12:26 UTC

it *should* certainly be possible ofr the rest of the mail system to carry
on resiliently even with spamd dead.   But that's something you need to ask the
CPanel folks -- it's an issue on their end.  (we will try to fix the bit about
spamd dying/hanging in the first place though. ;)

Comment 75 Tom Parkison 2004-11-14 17:39:07 UTC

I am getting sick of having to reboot this damn box everytime SpamD takes a 
shit.

You would think that this would force everything to die, right?
kill -9 `ps -aef | grep spam | grep -v grep | awk '{print $2}'`
kill -9 `ps -aef | grep imap | grep -v grep | awk '{print $2}'`
kill -9 `ps -aef | grep exim | grep -v grep | awk '{print $2}'`

Well, it doesn't.  It seems like the processes die, but "goast" processes 
remain.  So, even if you restart those services, the mail issues still remain.

This is so frustrating. :(

Comment 76 Sebastian Hagedorn 2004-11-15 04:32:36 UTC

Hi,

the patch and the PERL_SIGNALS=unsafe workaround don't work for us, because we are using Red Hat 
Enterprise Linux 3. It comes with perl 5.8.0. The PERL_SIGNALS environment variable is only honored 
for perl >= 5.8.1. According to the perl documentation and my own tests it may be preferable to set up 
the signal handlers differently. Let me quote perlipc:

"Note that some networking library functions like gethostbyname() are known to have their own 
implementations of timeouts which may conflict with your timeouts. If you are having problems with 
such functions, you can try using the POSIX sigaction() function, which bypasses the Perl safe signals 
(note that this means subjecting yourself to possible memory corruption, as described above). Instead 
of setting $SIG{ALRM} try something like the following:

    use POSIX;
    sigaction SIGALRM, new POSIX::SigAction sub { die "alarm\n" }
        or die "Error setting SIGALRM handler: $!\n";

" END OF QUOTE

Comment 77 Dallas Engelken 2004-11-15 11:28:53 UTC

I just noticed (while running a spamd --help) that in my spamd patch above, the
part that adds those 2 new timeout options to the man and --help, it lists
--timeout-child as  --timeout-timeout  - make sure whoever applies that patch
changes that on top of the default child timeout.

as a side note, i just got a spamd alert from one of my boxes... 

child pid 2351 running for  7 min, 57 sec
 -> ALERT: spamd pid 2351 ran for 7 min, 57 sec and was killed!
child pid 9321 running for  0 min, 02 sec
child pid 9490 running for  0 min, 02 sec
child pid 9708 running for  0 min, 00 sec
child pid 9711 running for  0 min, 00 sec

i'm going through the debug logs right now to figure out what it was doing.

Comment 78 Dallas Engelken 2004-11-15 11:43:24 UTC

okay, the problem child was 2351.  here is its life history.

12:28:54 - pid 2351 spawned.
12:28:58 - first message
12:31:31 - last log entry by pid 2351
12:45:01 - last log entry by spamd parent for pid 2351 cleanup

Here are the respective log entries for the 4 lines above.

2004-11-15 12:28:54.966903500 logmsg: [17464] server successfully spawned child
process, pid 2351
2004-11-15 12:31:31.173103500 logmsg: [2351] identified spam (10.3/7.0) for
$GLOBAL:0 in 1.1 seconds, 1332 bytes.
2004-11-15 12:31:31.173105500 logmsg: [2351] result: Y 10 -
DCC_CHECK,HELO_DYNAMIC_DHCP,HELO_DYNAMIC_IPADDR,HTML_50_60,HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,INFO_TLD,LONGWORDS,MIME_HTML_ONLY,MISSING_DATE
scantime=1.1,size=1332,mid=<XH5T31I8796S357q6w5Y65CE298X0W8GHE@dce.com>,autolearn=disabled
2004-11-15 12:45:01.061316500 logmsg: [17464] server hit by SIGCHLD
2004-11-15 12:45:01.061320500 logmsg: [17464] handled cleanup of child pid 2351

12:45:01 coincides with when my script sent a kill -9 to it, as it runs on a 5
minute cron.  also corresponds with the email received header  (in GMT)..
Received: (qmail 9714 invoked by uid 0); 15 Nov 2004 18:45:01 -0000

so, to sum up...

2351 processed 23 messsages, start to finish... 8 spam, 15 ham in almost a 3
minute span.. 12:28:58 to 12:31:31.  spamd is running with
--max-conn-per-child=25 so i dont think it has anything to do with that.

BEGINS
[root@email spamd]# cat spamd.log2 | grep "\[2351\] config"  | grep retrieving
|wc -l
     23

ENDS
[root@email spamd]# cat spamd.log2 | grep "\[2351\] result"  | wc -l
     23

so the log doesnt indicate any of them never finishing, unless it hung right
after message 23 was fully processed?  I guess i dont know what all goes on
after it sends the 'result' logmsg, it should just go back to read() for input
of #24 right?   The total life of pid 2351 was 6 minutes and 7 seconds, but ps
showed 7 minutes and 57 seconds of CPU usage...  during normal operation, the 25
max-conn-per-child will prevent a spamd child cpu utilization to under 30
seconds... so 7 minutes is very high.

here is the message i received from cron...

child pid 2351 running for  7 min, 57 sec
 -> ALERT: spamd pid 2351 ran for 7 min, 57 sec and was killed!
child pid 9321 running for  0 min, 02 sec
child pid 9490 running for  0 min, 02 sec
child pid 9708 running for  0 min, 00 sec
child pid 9711 running for  0 min, 00 sec

here are the stops and starts on the child processes seen above.

spamd.log:2004-11-15 12:28:54.966903500 logmsg: [17464] server successfully
spawned child process, pid 2351
spamd.log:2004-11-15 12:45:01.061320500 logmsg: [17464] handled cleanup of child
pid 2351

spamd.log:2004-11-15 12:44:20.793041500 logmsg: [17464] server successfully
spawned child process, pid 9321
spamd.log:2004-11-15 12:46:47.325710500 logmsg: [17464] handled cleanup of child
pid 9321

spamd.log:2004-11-15 12:44:39.292863500 logmsg: [17464] server successfully
spawned child process, pid 9490
spamd.log:2004-11-15 12:47:20.977168500 logmsg: [17464] handled cleanup of child
pid 9490
a
pid 9708 is no where to be found?  not sure what the deal is here??  maybe this
is part of the problem, but i dont know how.   its the only thing i can find
that is strange here because if ps says its a spamd child, there should be
something in the logs about it.

[root@email spamd]# grep "pid 9708" *
[root@email spamd]#

spamdlog:2004-11-15 04:00:39.199207500 logmsg: [17464] server successfully
spawned child process, pid 9711
spamdlog:2004-11-15 04:06:08.322575500 logmsg: [17464] handled cleanup of child
pid 9711

so now i dont know what to do!!  i guess i could add some debugging after the
result logmsg and see what its doing.. but now i have to wait for it to happen
again.  fun fun.

Comment 79 Tom Parkison 2004-11-15 11:54:38 UTC

Dallas, have any ideas why when I try to kill those processes, they stick 
around as zombies?

Comment 80 Dallas Engelken 2004-11-15 12:15:42 UTC

are the zombies created only from ones that you kill when they hang, or is a
zombie created from any spamd child you kill?

Comment 81 Tom Parkison 2004-11-15 12:25:48 UTC

> are the zombies created only from ones that you kill when they hang, or is a
zombie created from any spamd child you kill? 

That is the thing, I am not sure where these zombies are occuring.

Wait a second, I think I stumbled upon something here.  SpamD, at least on our 
servers, runs in a way that is linked to Exim.  So, when Exim receives a mail 
message, it locks the MailBox File until SpamAssassin is done scanning the 
mail message and passes it back to Exim.  So, when a SpamAssassin process 
takes a shit and dies improperly, Exim doesn't know about it, thus the MailBox 
files remain locked, thus unable to be read or written to until the zombie is 
killed, aka reboot.

Comment 82 Justin Mason 2004-11-15 14:46:01 UTC

   'Wait a second, I think I stumbled upon something here.  SpamD, at least on our 
servers, runs in a way that is linked to Exim.  So, when Exim receives a mail 
message, it locks the MailBox File until SpamAssassin is done scanning the 
mail message and passes it back to Exim.  So, when a SpamAssassin process 
takes a shit and dies improperly, Exim doesn't know about it, thus the MailBox 
files remain locked, thus unable to be read or written to until the zombie is 
killed, aka reboot.'

I'm not sure what this means.  does this setup require that SA do something in
particular before Exim will unlock the mailbox?  is there a way to deal with
these possible failure cases more resiliently? 

There's a myriad of ways for the SA scanning step to not happen or not complete,
some due to bugs in SA, some due to resource shortages, some due to timeouts,
and some due to sunspots or similar wierdness.  We (obviously) can try to avoid
that, but this is computer science, and nothing's perfect.  I would suggest
using a more resilient system that can cope with that; for example, if the SA
process returns without a result, the glue code should treat the message as
non-spam, deliver it, and unlock the mailbox.  (This is what spamc does, for
example.)

Comment 83 Tom Parkison 2004-11-16 16:01:59 UTC

This time, I executed SpamAssassin with the --debug flag.  I GREPed out the 
November 16th entries out of the /var/log/maillog file and am downloading the 
log files now.

Hopefully, I will spot something in it that will shed some light on this 
situation.

Comment 84 Tom Parkison 2004-11-16 16:25:39 UTC

Dallas Engelken and Justin Mason,

I didn't see much of anything in these log files that sparked my alarm.

It is locate at http://www.toms-world.org/privatedownloads/NOV16_spamd.zip

I will be sending you an email mesage with a username and password to download 
this file.

If you want to take a look at the files, go ahead.

Comment 85 Dallas Engelken 2004-11-17 09:18:14 UTC

27144 just looped.  here are the final log msgs from it.  nothing special.

@40000000419b80df2ecb8f94 logmsg: [27144] calling mail->finish
@40000000419b80df2ecc27ec debug: [27144] finish: called for Message.pm
@40000000419b80df2eebfc5c debug: [27144] finish: completed for Message.pm
@40000000419b815d245907e4 logmsg: [10975] handled cleanup of child pid 27144

looks like the parent tried to clean it up, but the child never went away. 

do you think this could be something to do with signals and running spamd under
daemontools?     i started up spamd from the command line with an strace no it
to see if i can capture yet even more data..

Comment 86 Dallas Engelken 2004-11-17 10:34:48 UTC

i got a copy of the email that hung on child pid 22626 at 11:39 am...  thanks to
full debug and 1gig log files :)

[root@email old]# ls -la *22626*
-rw-------    1 root     root         1385 Nov 17 11:39
spamassassin.22626.IQWf2F.tmp

but as you can see, it scans fine

[root@email old]# cat spamassassin.22626.IQWf2F.tmp | spamc -x 2>&1 >/dev/null
[root@email old]# STATUS=$?; echo $STATUS
0

as does the orig-* email that was dumped to disk from the incoming smtp
connection with is identical to the spamassassin tmp file.

so, we can rule out a specially crafted message for causing this.  (progress?)  :)

here is the debug log, with strace enabled, of that scan....

-------------------
child created..

2004-11-17 11:39:01.466228500 brk(0x8574000)                          = 0x8574000
2004-11-17 11:39:01.466231500 logmsg: [20563] server successfully spawned child
process, pid 22626
2004-11-17 11:39:01.466289500 brk(0x8575000)                          = 0x8575000

it took in 5 check requests..

[root@email spamd]# grep 22626 outfile  | grep retrieving
2004-11-17 11:39:04.487069500 debug: [22626] config: retrieving prefs for
$global from SQL server
2004-11-17 11:39:08.696062500
stat64("/usr/lib/perl5/site_perl/5.6.1/i386-linux/auto/Digest/SHA1/SHA1.bs",
debug: [22626] config: retrieving prefs for $global from SQL server
2004-11-17 11:39:12.844853500 debug: [22626] config: retrieving prefs for
$global from SQL server
2004-11-17 11:39:16.594829500 debug: [22626] config: retrieving prefs for
$global from SQL server
2004-11-17 11:39:21.764975500 brk(0x82cd000debug: [22626] config: retrieving
prefs for $global from SQL server


it processed the first 4, and never finished on #5.

check # 5 started at  
 11:39:21.764975500

it waited on a bayes lock for 7 seconds... finally syncing at 
 2004-11-17 11:39:28.485283500 stat64("../lib/Pod/Text.pmc", debug: [22626]
bayes: synced databases from journal in 7 seconds: 15500 unique entries (19526
total entries)

pyzor timed out in 3 seconds, that takes us to 11:39:32
 2004-11-17 11:39:32.422429500 debug: [22626] pyzor: check timed out after 3 secs.

uribldns lookups complete successfully 
 2004-11-17 11:39:32.479788500 debug: [22626] uridnsbl: queries completed: 4
started: 2

last log entry from check #5 at 
 2004-11-17 11:39:32.503748500 debug: [22626] auto-whitelist: sql-based
connected to DBI:mysql:logs:localhost:3306


then, a couple minutes later, child destroyed (by my kill -9)

2004-11-17 11:41:41.995309500 logmsg: [20563] received child_handler CHLD
2004-11-17 11:41:41.995360500 logmsg: [20563] server hit by SIGCHLD
2004-11-17 11:41:41.995423500 logmsg: [20563] handled cleanup of child pid 22626
2004-11-17 11:41:41.995444500 logmsg: [20563] set SIGCHLD
2004-11-17 11:41:41.995486500 logmsg: [20563] Need to Spawn a new CHILD!!!
2004-11-17 11:41:41.999028500 logmsg: [20563] server successfully spawned child
process, pid 29029
-------------------


so not sure what happended, but this one hung on AWL lookup...   i'll trace
another here shortly as they seem to be hanging quite frequently on this box.

fyi, AWL works fine on all other messages... and there is nothing wrong with the
mysql end of it.

mysql> select count(*) from awl;
+----------+
| count(*) |
+----------+
|   673013 |
+----------+
1 row in set (0.00 sec)

Comment 87 Michael Parker 2004-11-17 12:29:25 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Wed, Nov 17, 2004 at 10:34:49AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> last log entry from check #5 at 
>  2004-11-17 11:39:32.503748500 debug: [22626] auto-whitelist: sql-based
> connected to DBI:mysql:logs:localhost:3306
> 

Interesting,  can you tell me what version of MySQL, DBI and
DBD::mysql you are running.  Anything in the strace from mysql itself?

> 
> then, a couple minutes later, child destroyed (by my kill -9)
> 

Curious, is kill -9 the only way to kill the child?

Michael

Comment 88 Dallas Engelken 2004-11-17 12:37:03 UTC

[root@email spamd]# rpm -qa | grep mysql-server
mysql-server-3.23.58-1.73
[root@email spamd]# rpm -qa | grep DBI
perl-DBI-1.21-1
[root@email spamd]# rpm -qa | grep DBD
perl-DBD-MySQL-1.2219-6

i didnt see much else in the strace that was meaningful.

i have not tried anything other than kill -9 on the child, what would you like
me to send to it next time?

weird thing is, i was having a child hang about every 10 minutes this morning
and been sitting here since 11:30am CST waiting for another one to hang, but
nothing.

its very strange.

Comment 89 Justin Mason 2004-11-17 12:50:52 UTC

hi Dallas - those strace lines don't match the debugs.... are you stracing the
children?  use -f to follow through forks.

also kill -15 is safer than kill -9 for clearing up locks...

Comment 90 Dallas Engelken 2004-11-17 12:59:41 UTC

i was just using
# strace spamd <params>

i will add the -f in and try a kill -15 on the next one.

Comment 91 Justin Mason 2004-11-17 13:07:58 UTC

Dallas btw -- what's the CPU on that box?  is it hyperthreaded?

Comment 92 Dallas Engelken 2004-11-17 13:18:05 UTC

yes sir.. its a big one.  2 cpu, 2gig ram.

  3:16pm  up 22:03,  1 user,  load average: 2.30, 1.43, 1.33
157 processes: 153 sleeping, 3 running, 1 zombie, 0 stopped
CPU0 states: 33.1% user,  9.3% system,  0.0% nice, 56.4% idle
CPU1 states: 14.4% user,  4.4% system,  0.0% nice, 80.0% idle
CPU2 states: 18.2% user,  7.3% system,  0.0% nice, 73.3% idle
CPU3 states: 34.2% user,  6.2% system,  0.0% nice, 58.4% idle
Mem:  2064860K av,  894704K used, 1170156K free,       0K shrd,  121320K buff
Swap:  522104K av,     228K used,  521876K free                  382836K cached

the zombie is my strace through daemontools.. kinda kludgy but it works :)

with strace -f i have to cronjob a custom logrotate ever 15 minutes.. lol

Comment 93 Michael Parker 2004-11-17 14:49:00 UTC

Created attachment 2520 [details]
3.0 Patch File

Here is a patch against 3.0, it fixes the help typo and ups the default child
timeout to 300 secs.

Comment 94 Michael Parker 2004-11-17 14:49:52 UTC

Please review the patch for 3.0, we won't be closing this ticket, because it is
just a work around (and seems a good idea) and doesn't fix the actual bug.

Comment 95 Daniel Quinlan 2004-11-17 15:23:24 UTC

+1

Please fix weird spacing around parens.  :-)

Comment 96 Sebastian Hagedorn 2004-11-18 01:21:11 UTC

Let me repeat my statement from comment #76: this patch does *not* work in all instances because of 
perl 5.8's safe signal handling. According to the docs it would be better to use sigaction SIGALRM 
instead.

Comment 97 Tom Parkison 2004-11-18 07:29:10 UTC

So Sebastian Hagedorn, would this work?

Change...
$SIG{'ALRM'} = sub { die (" tcp timeout ") };
To...
$SIG{'SIGALRM'} = sub { die (" tcp timeout ") };

Change...
$SIG{'ALRM'} = sub { die (" child processing timeout ") };
To...
$SIG{'SIGALRM'} = sub { die (" child processing timeout ") };

Comment 98 Sebastian Hagedorn 2004-11-18 07:31:32 UTC

No, I've actually created a patch using SigAction that I'm currently testing. I'm not yet sure if it's actually 
working ;-) I haven't yet had any timeout messages. I will atach the patch anyway, just to show what I 
mean.

Comment 99 Tom Parkison 2004-11-18 07:36:41 UTC

Yeah, please do so Sebastian Hagedorn, I would like to test it.

Comment 100 Sebastian Hagedorn 2004-11-18 07:39:49 UTC

Created attachment 2526 [details]
Patch using POSIX::SigAction

This patch is basically the same, but uses POSIX::SigAction instead ...

Comment 101 Tom Parkison 2004-11-18 07:44:21 UTC

Will this patch of yours be able to be applied directly to the SpamAssassin
3.0.1 source, or will I have to apply the original patch by Dallas Engelken then
yours?

Comment 102 Sebastian Hagedorn 2004-11-18 07:52:07 UTC

My patch replaces the previous patches, i.e. it can be applied to a fresh 3.0.1

Comment 103 Tom Parkison 2004-11-18 09:26:00 UTC

I can't find the Perl Modules known as POSIX::SigAction.  CPAN doesn't seem to 
know about it.

Comment 104 Sidney Markowitz 2004-11-18 09:41:53 UTC

Should we be using this module or do the equivalent in our code?

http://search.cpan.org/~lbaxter/Sys-SigAction/lib/Sys/SigAction.pm

From that page:

[after some discussion of the different things implemented in different version
of perl]

"Since sigaction() is not fully functional in perl versions less than 5.8, this
module implements equivalent behavior using the standard %SIG array. The version
checking and implementation of the 'right' code is handled by this module, so
the user does not have to write perl version dependant code. The attrs hashref
argument to set_sig_handler() is silently ignored, in perl versions less than
5.8. This module has been tested with perls as old as 5.005 on solaris.

It is hoped that with the use of this module, your signal handling behavior can
be coded in a way that does not change from one perl version to the next, and
that it makes using sigaction() a little easier."

Comment 105 Tom Parkison 2004-11-18 11:24:03 UTC

Sebastian Hagedorn,

I applied your patch.  I am crossing my fingers

Comment 106 Justin Mason 2004-11-18 16:23:37 UTC

Good catch, Sebastian!  POSIX::sigaction will, indeed, solve two problems:

- 1. deal with internal timeouts using SIGALRM overwriting our one

- 2. always use "unsafe" signal handlers, which are essential in this case to
  interrupt all possible hangs, including regexp complexity ones.

however Sys::SigAction does also note that in perl versions prior to 5.8,
POSIX::sigaction "does not work correctly".  So I agree with Sidney, we need to
implement code to do the same thing using %SIG, in our codebase.

in the latter case, #1 will be unavoidable.  but that seems to have worked for
Dallas anyway -- I would surmise because the rules that set alarms (namely
DCC/Pyzor/Razor) are happening late enough that the error condition this
catches, has already happened by that stage if it was going to happen.

I think we need a new patch that does the same thing as Dallas' patch, but using
functions that will switch between use of %SIG and POSIX::sigaction depending on
$^V.  btw, any chance someone with a CLA on file could do this?

Comment 107 Sebastian Hagedorn 2004-11-19 01:20:33 UTC

Well, my POSIX patch doesn't improve things for me. Even though the unsafe signal handling should be 
used, no alarms go off. I have now used Sys::SigAction in the hope that it sets up the signal handler 
better. I have a patch for that, but will wait for my test results before posting it ... if this doesn't work 
either, I don't know what else to try.

Comment 108 Dallas Engelken 2004-11-19 10:56:39 UTC

here is another one today...

child pid 9638 running for  5 min, 29 sec
 -> ALERT: spamd pid 9638 ran for 5 min, 29 sec and was killed!

i have that script running a kill -9 just in case i cant get to the system at
the time.   hopefully i can catch one today and kill -15 it instead.

2004-11-19 12:09:32.391200500 debug: [9638] rules: running header regexp tests;
score so far=8.273
2004-11-19 12:09:32.391202500 debug: [9638] auto-whitelist: sql-based connected
to DBI:mysql:logs:localhost:3306
2004-11-19 12:09:32.724190500  <unfinished ...>
2004-11-19 12:09:32.724220500 [pid 11363] <... read resumed> 0x401f8000, 4096) =
? ERESTARTSYS (To be restarted)
2004-11-19 12:09:32.724240500 [pid 11363] --- SIGALRM (Alarm clock) ---

again here, the last line was the sql connection for AWL.  half a second later
it pulls a sucessful AWL, so who knows....

2004-11-19 12:09:32.806266500 debug: [11363] auto-whitelist: sql-based connected
to DBI:mysql:logs:localhost:3306
2004-11-19 12:09:32.807102500 debug: [11363] auto-whitelist: sql-based
get_addr_entry: found existing entry for fw@westiva.com|ip=8.6
2004-11-19 12:09:32.807149500 debug: [11363] auto-whitelist: sql-based
fw@westiva.com|ip=8.6 scores 33/240.266
2004-11-19 12:09:32.807316500 debug: [11363] auto-whitelist: AWL active,
pre-score: 8.273, autolearn score: 8.273, mean: 7.28078787878788, IP: 8.6.241.121
2004-11-19 12:09:32.808017500 debug: [11363] auto-whitelist: sql-based
add_score: new count: 34, new totscore: 248.539 for fw@westiva.com|ip=8.6
2004-11-19 12:09:32.808254500 debug: [11363] auto-whitelist: sql-based finish:
disconnected from DBI:mysql:logs:localhost:3306
2004-11-19 12:09:32.808405500 debug: [11363] auto-whitelist: post auto-whitelist
score: 7.77689393939394

this is twice i have caught a spamd child that has hung where the last debug
line it sent was the auto-whitelist sql connection.   hopefully i'll see another
one soon.

Comment 109 Michael Parker 2004-11-19 11:31:17 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Fri, Nov 19, 2004 at 10:56:40AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> this is twice i have caught a spamd child that has hung where the last debug
> line it sent was the auto-whitelist sql connection.   hopefully i'll see another
> one soon.
> 

I'm trying to remember, was there any indiciation (besides these
latest two) that AWL was involved?

Could be some sort of deadlock in MySQL.  Is there ever a case when
one spamd has hung that another takes a slightly longer (but not long
enough to trigger the timeout/killer script) amount of time that
usual?

Michael

Comment 110 Dallas Engelken 2004-11-19 11:53:07 UTC

the last 2 hangs were on awl.. the others i'm not exactly sure because i didnt
have full debugging enabled.   i've been running an strace -f with full debug
for the last 3 days and have seen the 3 hangs.  the one from yesterday i jacked
up the log file... i think the last line on it was something about bayes, but
since i lost some of that data, i didnt want to trust that read.

the weird thing is awl processed fine half a second after that child hung. 
there is not much in common here... the time that they hang is around 12pm
(11:39 and 12:09 so far), but nothing else apparent runs then.   mysqld shows no
errors or problems that i can see.   

if i get 1 more that indicated awl, i'll shut that down and see if the problems
go away.  personally i dont think its awl, and i've taken a look at the code
around that dbg and dont see anything that could case it either.

Comment 111 Trent Muldrow 2004-11-20 05:23:11 UTC

Justin, I think you might have hit the nail right on the head with this 
statement:
 
> Dallas btw -- what's the CPU on that box?  is it hyperthreaded?
 
Tom Parkinson and I were chatting last night about some possibilities for this 
problem and I suggested that it might help if I assembled a timeline of the 
problem since we were not experiencing it from day 1.

The timeline showed the problem appearing a few days after the installation of 
an SMP enabled kernel (hyperthreading) on the box and stopped after they 
removed it for a more stable non-SMP kernel.  This went on fine until there was 
a HD crash on this server.  At this point, I believe the SMP enabled kernel may 
be have been reinstalled and no one caught it.  Ever since the system was 
reloaded from that HD crash, the SpamAssassin issue has been hot and heavy.  
Patches have slowed it down and reduced the effects, but it's definitely still 
there.
 
I have just posted a new message in the cPanel forum thread on this issue 
(http://forums.cpanel.net/showthread.php?p=153528#post153528) asking everyone 
that is affected by this issue to post their CPU type, whether it is 
multithreaded or if they have multiple CPU's in their system, and if they are 
running a multithreaded/multiple CPU aware kernel.  I'll ask the same questions 
here:

1. What type of CPU are you running?

2. Is it multi-threaded or do you have multiple CPU's?

3. Are you running a kernel that is using that feature?

My theory is that most everyone that is experiencing this problem will answer 
yes to at least 1 and 2, but we'll see....

Comment 112 Michael Parker 2004-11-20 10:18:42 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Sat, Nov 20, 2004 at 05:23:13AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> 1. What type of CPU are you running?

P4 with Hyperthreading

> 2. Is it multi-threaded or do you have multiple CPU's?

Yes, hyperthreaded

> 3. Are you running a kernel that is using that feature?

Yes,
Linux hoth 2.6.8-24-smp #1 SMP Wed Oct 6 09:16:23 UTC 2004 i686 i686 i386 GNU/Linux

> My theory is that most everyone that is experiencing this problem will answer 
> yes to at least 1 and 2, but we'll see....

I don't fit in this group and I will say that I manage to push this
machine VERY hard, 300 msgs/sec at times.

Michael

Comment 113 Michael Parker 2004-11-20 12:28:48 UTC

I went off and read the CPanel forum discussion.  The more I hear/read from the
CPanel folks having a problem, the more I'm convinced that their problems are
unrelated.

There seems to be no common denominator amoungst those folks, different versions
(including 2.6x versions), different setups, bayes off and on, awl off and on,
different calling methods.

I'm not saying there isn't a problem there, but I don't believe they are
related.  I suggest opening up a seperate bug to handle the issue CPanel folks
are having.

Comment 114 Justin Mason 2004-11-20 15:05:50 UTC

+1 on the idea of splitting off the cpanel discussion -- I agree it sounds like
different issues.  we should at least keep this bug open, solely to work on the
timeout patch.

Comment 115 George N. White III 2004-11-22 09:04:58 UTC

I have seen a few instances where spamd spawned 100's of child processes until
the system (SGI Irix 6.5.22) wasn't able to create new processes.  This tends
to happen when I run fetchmail to bring in a batch of messages from an address
that is in the ghostscript distro (e.g., attracts lots of spam and virus mails).
While this is happening DNS queries (made other machines) often fail, so my 
feeling is that a batch of spam/virus emails will cause a lot of queries for
"new" sites (e.g., not in some cache) and bog down the DNS.  I use procmail,
and I'm not entirely sure where DNS queries occur.  Maybe there should
be separate queues for messages with headers that have addresses not in the cache
so these can be fed thru at a rate that doesn't result in DOS.

Comment 116 Dallas Engelken 2004-11-23 06:13:18 UTC

another hung child this morning... first one in several days.

child pid 29538 running for  6 min, 48 sec
 -> ALERT: spamd pid 29538 ran for 6 min, 48 sec and was killed!
child pid 3662 running for  0 min, 01 sec
child pid 3722 running for  0 min, 01 sec
child pid 3732 running for  0 min, 01 sec
child pid 3764 running for  0 min, 01 sec
child pid 3878 running for  0 min, 02 sec
child pid 4271 running for  0 min, 01 sec
child pid 4435 running for  0 min, 00 sec

last few lines of debug from that child... (mostly ones i added to figure this out)

2004-11-23 07:04:13.982902500 logmsg: [29538] identified spam (15.6/5.0) for
global:0 in 66.5 seconds, 1398 bytes.
2004-11-23 07:04:13.983016500 logmsg: [29538] result: Y 15 -
BAYES_50,HTML_MESSAGE,RCVD_NUMERIC_HELO,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL
scantime=66.5,size=1398,mid=<110121498458931012@localhost>,bayes=0.5035849521974,autolearn=no
2004-11-23 07:04:13.983041500 logmsg: [29538] calling status->finish
2004-11-23 07:04:13.983060500 debug: [29538] generic: call_plugins per_msg_finish
2004-11-23 07:04:13.983094500 debug: [29538] call_plugins: run callback on
per_msg_finish
2004-11-23 07:04:13.983118500 debug: [29538] generic: callback per_msg_finish
started
2004-11-23 07:04:13.983139500 debug: [29538] generic: lets step thorugh the cbpairs
2004-11-23 07:04:13.983160500 debug: [29538] generic: returning
2004-11-23 07:04:13.983180500 debug: [29538] generic: call_plugins
per_msg_finish complete
2004-11-23 07:04:13.984008500 debug: [29538] generic: done deleting self keys
2004-11-23 07:04:13.984049500 logmsg: [29538] calling mail->finish
2004-11-23 07:04:13.984082500 debug: [29538] finish: called for Message.pm
2004-11-23 07:04:13.984600500 debug: [29538] finish: completed for Message.pm
2004-11-23 07:04:13.984676500 logmsg: [29538] spamd: done with check, calling
client->close
2004-11-23 07:04:13.984759500 logmsg: [29538] spamd: returning from accept_a_conn
2004-11-23 07:04:14.307196500  <unfinished ...>
2004-11-23 07:10:00.693383500 write(2, "logmsg: [19792] handled cleanup "...,
51logmsg: [19792] handled cleanup of child pid 29538

as you can see, it hung after completion of a scan, not awl this time.  i think
this is the 2nd or 3rd time i've seen this situation now.

so maybe i need more debug after returning from accept_a_conn() to figure this
out.  really the only things that happen after accept_a_conn() are error
checking, uid/gid switching, and config copying... right?

Comment 117 Dallas Engelken 2004-11-23 14:00:51 UTC

i just did some more work on my timeout patch and was wanting a little feedback.

basically what i am doing is passing the timeout_child to Message->new() in the
$opts and wrote a function in there to calculate run time.   then, any time an
function completes that had an alarm() on it, i replace the alarm(0) with
alarm($self->{msg}->_get_runtime());   

so instead of setting alarm(0) which will override the alarm that wraps the
check() funciton in /usr/bin/spam, it will set the alarm to ($timeout_child -
$runtime), hence keeping the alarm that wraps the check() functional and accurate.

as far as i can tell, there are only a handle of alarmed functions, pyzor,
razor, dcc, and a couple alarms in Dns.pm.

does anyone see anything wrong with this idea before i clean this thing up and
do some testing on it.  without doing this, a check() is only alarmed to
$timeout_child until it hits an alarmed function that sets alarm back to 0. 
this should prevent that.

thanks

Comment 118 Theo Van Dinter 2004-11-23 14:03:12 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Tue, Nov 23, 2004 at 02:00:52PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> basically what i am doing is passing the timeout_child to Message->new() in the
> $opts and wrote a function in there to calculate run time.   then, any time an
> function completes that had an alarm() on it, i replace the alarm(0) with
> alarm($self->{msg}->_get_runtime());   

-1

Message() should have nothing to do with timeouts.

Comment 119 Dallas Engelken 2004-11-23 14:03:35 UTC

wow, my fingers are frozen from cleaning out the gutters and hanging xmas
lights...   funciton = function, handle = handful, and  /usr/bin/spam =
/usr/bin/spamd 

haha.. proofread? never :)

Comment 120 Theo Van Dinter 2004-11-23 14:06:19 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Tue, Nov 23, 2004 at 05:03:09PM -0500, Theo Van Dinter wrote:
> Message() should have nothing to do with timeouts.

Err, that's M::SA::Message, of course. <G>

Comment 121 Dallas Engelken 2004-11-23 15:01:54 UTC

okay, help me understand the flow here in spamd, this is what i see..

1) spamd M::SA->new 
2) spamd spawn children
3) spamd accept_a_conn()
4) spamd check()
5) M::SA::parse() (new Message)
6) M::SA::check() (new PerMsgStatus)
7) M::SA::PMS::check()
8) do all the tests

so, the optimal place to put a runtime calculator would be where?  start_time on
that particular scan needs to be passed to either Message.pm or PerMsgStatus.pm
for storage to refer to it  before the PMS::check() is ran.   it cant be stored
in $self->{main} unless it was something like
$self->{main}->{pid_of_child}->{start_time} and let that child update the
start_time everytime it starts a new message.  However, i think it would be
better if it were stored lower, in something that was destroyed after the check
was complete.

all placement considerations aside, do you think this is a reasonalbe idea to
set the alarm time back to "timeout_child - runtime" instead of zeroing it out?

Comment 122 Dallas Engelken 2004-11-23 15:26:47 UTC

would this method be better..  in spamd
  my $status = $spamtest->check($mail);
becomes
  my $status = $spamtest->check($mail,time());

and then in M::SA::check() actually pass the start_time into the PerMsgStatus
$opts?  PerMsgStatus::_get_runtime() could compare $self->{start_time} against
$self->{main}->{timeout_child} which is set via M::SA->new $opts.

if that was the case, how would i call _get_runtime() in M::SA::PMS?  is there 
something similar like $self->{main}, $self->{msg}, or $self->{conf} to refer to
PerMsgStatus?  or do you just have to make a full call to it
M::SA::PerMsgStatus::_get_runtime() from within Razor2.pm, SPF.pm, and other
places that might need to use it?

Comment 123 Justin Mason 2004-11-23 15:37:46 UTC

'it cant be stored
in $self->{main} unless it was something like
$self->{main}->{pid_of_child}->{start_time}'

actually, no, it's fine to do that -- each M:SA object is single-threaded.

but, I've just noticed something in perldoc -f alarm:

               Only one timer may be counting at once.  Each call disables the
               previous timer, and an argument of 0 may be supplied to cancel
               the previous timer without starting a new one.  The returned
               value is the amount of time remaining on the previous timer.

so IMO it should be simpler to just fix the helper-app timeout code to save any
previous timers, and restore them afterwards instead of using "alarm 0", like so:

   my $oldalarm = alarm($timeout);
   ....timeout-sensitive code...
   alarm $oldalarm;  #timeout set back to what it was

Comment 124 Sidney Markowitz 2004-11-23 16:11:29 UTC

Alarm::Queued and Alarm::Concurrent are two classes to support multiple alarms,
in case the code you would need to add is complex enough to justify another
dependency.

Comment 125 Dallas Engelken 2004-11-24 06:47:55 UTC

justin

okay, but we still run into a similar deal here.  the alarm that encompasses
check() in spamd is what we would want to store as $oldalarm.   and in order to
restore $oldalarm we would have to get it passed into PerMsgStatus, since
$oldalarm would be set per-child, per-check.  

'actually, no, it's fine to do that -- each M:SA object is single-threaded.' 

but the spamd daemon only creates one instance of M::SA->new(), so storing
start_time or $oldalarm or anything would not be child (per-check) specific,
since M::SA->new() is not called per accept_a_conn().   or maybe i'm missing
something here?   The only things i see called per-check is
M::SA::Message->new() and M::SA::PerMsgStatus->new(), so start_time or $oldalarm
would need to be passed in there somehow to have access to it.

or i guess instead of alarming spamd with timeout_child, it could be moved down
to M::SA::check() instead.   This would help people that dont use spamd take
advantage of running command line spamassassin with a --timeout-child also.

d

d

Comment 126 Dallas Engelken 2004-11-24 07:00:49 UTC

okay, after adding additional debug at the cleanup stages, it looks like it
hangs here on copy_config().  and that makes sense because the alarm on check()
is shut off at that point, so the child spins forever chewing up CPU time.

i added the following logmsg's, so it should have said "lets undef current_user"
immediately after it finished copy_config().

        logmsg("calling copy_config");
        $spamtest->copy_config(\%conf_backup, undef) ||
          die "error returned from copy_config, no Storable module?\n";
      }
      logmsg("lets undef current_user");
      undef $current_user;


here is the debug showing where it stopped.


2004-11-24 06:00:46.479088500 debug: [21383] generic: returning
2004-11-24 06:00:46.479090500 debug: [21383] generic: done with call_plugin
check_end
2004-11-24 06:00:46.483242500 logmsg: [21383] clean message (-0.8/5.0) for
$global:0 in 4.5 seconds, 1521 bytes.
2004-11-24 06:00:46.483341500 logmsg: [21383] result: .  0 -
AWL,BAYES_05,HTML_10_20,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,NO_REAL_NAME
scantime=4.5,size=1521,mid=<13200493.1101297640500.JavaMail.webspher@ccwest-obpapp0>,bayes=0.015070754421061,autolearn=no
2004-11-24 06:00:46.483364500 logmsg: [21383] calling status->finish
2004-11-24 06:00:46.483384500 debug: [21383] generic: call_plugins per_msg_finish
2004-11-24 06:00:46.483404500 debug: [21383] call_plugins: run callback on
per_msg_finish
2004-11-24 06:00:46.483424500 debug: [21383] generic: callback per_msg_finish
started
2004-11-24 06:00:46.483426500 debug: [21383] generic: lets step thorugh the cbpairs
2004-11-24 06:00:46.483445500 debug: [21383] generic: returning
2004-11-24 06:00:46.483464500 debug: [21383] generic: call_plugins
per_msg_finish complete
2004-11-24 06:00:46.484231500 debug: [21383] generic: done deleting self keys
2004-11-24 06:00:46.484253500 logmsg: [21383] calling mail->finish
2004-11-24 06:00:46.484288500 debug: [21383] finish: called for Message.pm
2004-11-24 06:00:46.484602500 debug: [21383] finish: completed for Message.pm
2004-11-24 06:00:46.484708500 logmsg: [21383] spamd: done with check, calling
client->close
2004-11-24 06:00:46.484827500 logmsg: [21383] spamd: returning from accept_a_conn
2004-11-24 06:00:46.484876500 logmsg: [21383] returned, evalret is 1
2004-11-24 06:00:46.484897500 logmsg: [21383] do we need to switch uid/gid back?
2004-11-24 06:00:46.484918500 logmsg: [21383] lets restore the config
2004-11-24 06:00:46.484959500 logmsg: [21383] while config in backup, username root
2004-11-24 06:00:46.485000500 logmsg: [21383] while config in backup,
learn_to_journal 1
2004-11-24 06:00:46.485019500 logmsg: [21383] calling copy_config


2004-11-24 06:05:01.262876500 logmsg: [8814] received child_handler CHLD
2004-11-24 06:05:01.262949500 logmsg: [8814] server hit by SIGCHLD
2004-11-24 06:05:01.263009500 logmsg: [8814] handled cleanup of child pid 21383
2004-11-24 06:05:01.263035500 logmsg: [8814] set SIGCHLD
2004-11-24 06:05:01.263108500 logmsg: [8814] Need to Spawn a new CHILD!!!
2004-11-24 06:05:01.268627500 logmsg: [8814] server successfully spawned child
process, pid 25546

so i guess now i get to dbg out sub copy_config.  :)

Comment 127 Michael Parker 2004-11-24 07:38:42 UTC

Subject: Re:  [review] spamd parent stops accepting requests

> 
> so i guess now i get to dbg out sub copy_config.  :)
> 

What version of Storable are you running?

Michael

Comment 128 Dallas Engelken 2004-11-24 07:59:23 UTC

[root@dev perlmod]# ls -la Storable-2.08.tar.gz
-rw-r--r--    1 dallase  dallase     97219 Sep  5  2003 Storable-2.08.tar.gz

[root@mailgw root]# grep VERSION /usr/lib/perl5/5.6.1/i386-linux/Storable.pm
use vars qw($canonical $forgive_me $VERSION);
$VERSION = '2.08';

d

Comment 129 Justin Mason 2004-11-24 10:05:01 UTC

Subject: Re:  [review] spamd parent stops accepting requests 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> okay, but we still run into a similar deal here.  the alarm that
> encompasses check() in spamd is what we would want to store as
> $oldalarm.   and in order to restore $oldalarm we would have to get it
> passed into PerMsgStatus, since $oldalarm would be set per-child,
> per-check.  

No -- it doesn't need to be passed anywhere.   The key point: the alarm
doco indicates that when you call alarm() again, it'll tell you how much
time is left -- so *that value* can be used to reinstate the alarm again.
no other var is necessary.

I'll go into more detail with some code snippets.

in accept_a_conn():

     eval {
       $SIG{'ALRM'} = sub { die (" child processing timeout ") };
       alarm $timeout_child if ($timeout_child);
       check( $1, $2, $start, $remote_hostname, $remote_hostaddr );
     };
     alarm 0;

(same as normal).
in dcc_lookup(), and other fns that use alarm() themselves:

+ my $oldalarm;
  eval {
    local $SIG{ALRM} = sub { die "__alarm__\n" };
    local $SIG{PIPE} = sub { die "__brokenpipe__\n" };

+   $oldalarm = alarm($timeout);

    # Note: not really tainted, these both come from system conf file.
    my $path = Mail::SpamAssassin::Util::untaint_file_path ($self->{conf}->{dcc_path});
    [... dcc checking]

    dbg("dcc: got response: $response");

+   alarm $oldalarm;
    $self->cleanup_kids($pid);
  };

+ alarm $oldalarm;


so IOW the setting of $timeout_child doesn't have go beyond spamd; the
alarm() API allows us to take care of it everywhere else, because alarm()
itself will return the details we need.

Note that if no alarm is already set, ie. if timeout-child wasn't used,
"$oldalarm = alarm($timeout);" will still do the right thing as $oldalarm
will be 0.



anyway, just to deal with the other q's:

> 'actually, no, it's fine to do that -- each M:SA object is
> single-threaded.' 
> 
> but the spamd daemon only creates one instance of M::SA->new(), so
> storing start_time or $oldalarm or anything would not be child
> (per-check) specific, since M::SA->new() is not called per
> accept_a_conn().   or maybe i'm missing something here?   The only
> things i see called per-check is M::SA::Message->new() and
> M::SA::PerMsgStatus->new(), so start_time or $oldalarm would need to be
> passed in there somehow to have access to it.

you can just add a var on the M:Sa object before
check() starts --

    $spamtest->{some_var} = "blah";

M:Sa objects are just hashes.

> or i guess instead of alarming spamd with timeout_child, it could be
> moved down to M::SA::check() instead.   This would help people that dont
> use spamd take advantage of running command line spamassassin with a
> --timeout-child also.

I think the spamd alarm is cleaner. see above.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBpM0aMJF5cimLx9ARAsKHAJ0a6Ae7EtZmmseAXVsihztgCPOmxQCgrBuy
va6EKlAixASEUXQ8/9IdpOQ=
=qwXu
-----END PGP SIGNATURE-----

Comment 130 Tom Parkison 2004-12-03 12:32:33 UTC

Why hasn't there been any activity as of the last week?

Comment 131 Michael Parker 2004-12-03 12:40:06 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Fri, Dec 03, 2004 at 12:32:34PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From tom@rochen.com  2004-12-03 12:32 -------
> Why hasn't there been any activity as of the last week?

Two things:

1) It looks like there might be a better code related solution to the
   whole timeout issue, so now we need a new patch.

2) I firmly believe that the problems you describe have nothing to do
   with this particular bug.  I suggested that you open another bug to
   track that issue.

Michael

Comment 132 Daniel Quinlan 2004-12-03 12:41:37 UTC

> Why hasn't there been any activity as of the last week?

Because nobody has worked on it in the last week.

Comment 133 Justin Mason 2004-12-07 18:03:34 UTC

alright -- I've put in a bit of legwork on this.   here's a test
script that exercises an evil regexp exponential-time attack, using
a particularly awful regexp and piece of data, and then attempts to
use various forms of signal handler to deal with it.

as far as I know, it should now work on all versions of perl, with
and without PERL_SIGNALS=unsafe -- in other words, it works around
the issue entirely!

but if you have really odd versions of perl (ie. != 5.6.1 or 5.8.4),
please run it and, if the output differs from this:

  using sigaction         [or "using SIG"]
  entering re match
  eval caught: got sigalrm successfully
  done re match in 2 secs

then post the output.

  #!/usr/bin/perl -w

  use strict; trap_sigalrm (\&got_alrm); alarm 2;
  my $start = time; print "entering re match\n";

  my $text = (("o" x 9999) . "x") x 999;
  eval { $text =~ /o*xo*y/; };

  if ($@) { warn "eval caught: $@"; }
  alarm 0;

  my $end = time; print "done re match in ".($end - $start)." secs\n"; exit;

  sub got_alrm { die "got sigalrm successfully\n"; }

  sub trap_sigalrm {
    my ($handler) = @_;

    if ($^V lt v5.8.0) {
      print "using SIG\n";
      $SIG{ALRM} = $handler;
    } else {
      print "using sigaction\n";
      use POSIX qw();
      POSIX::sigaction POSIX::SIGALRM(), new POSIX::SigAction $handler;
    }
  }

Comment 134 Justin Mason 2004-12-07 19:21:46 UTC

Created attachment 2553 [details]
patch using sigaction and/or %SIG as appropriate

ok, this patch:

- uses POSIX::sigaction on perl > 5.8.0, and %SIG on earlier perls, so covers
all versions

- handles saving "higher-level" alarms in the lesser alarms, internally, so the
higher-level alarm keeps running after they're completed

- fixes some minor things; e.g. all the stuff that looks at $@ should really
save a copy of that before calling other functions, in case future perl
versions change the semantics of $@ to be more like errno

Comment 135 Daryl C. W. O'Shea 2004-12-07 20:40:35 UTC

Subject: Re:  [review] spamd parent stops accepting requests

> ------- Additional Comments From jm@jmason.org  2004-12-07 18:03
> but if you have really odd versions of perl (ie. != 5.6.1 or 5.8.4),
> please run it and, if the output differs from this:

Not really odd, but 5.8.0, 5.8.3 and 5.8.5 (all on RedHat 9) work.

Comment 136 Kelsey Cummings 2004-12-07 22:07:31 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Tue, Dec 07, 2004 at 06:03:35PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> http://bugzilla.spamassassin.org/show_bug.cgi?id=3828

perl 5.005_03, FreeBSD 4.6-STABLE, doesn't appear to have $^V defined.

But other than that and some missing quotes to make strict happy,

using SIG
entering re match
eval caught: got sigalrm successfully
done re match in 2 secs

perl 5.8.2, OpenBSD 3.5
using sigaction
entering re match
eval caught: got sigalrm successfully
done re match in 2 secs

Comment 137 Dallas Engelken 2004-12-08 06:49:53 UTC

perl 5.6.1 on redhat 7.3 works.

using SIG
entering re match
eval caught: got sigalrm successfully
done re match in 2 secs

Comment 138 Sidney Markowitz 2004-12-08 06:59:59 UTC

perl 5.8.5 on Cygwin works.

Win32 perl doesn't do signals (and so doesn't do spamd anyway).

Comment 139 Dallas Engelken 2004-12-08 11:10:50 UTC

just had a child hang on another box.. took up 7+ min of cpu before it was
killed.  here is the last 2 dbg mesgs from the log for pid 29170.

2004-12-08 12:31:52.653883500 debug: [29170] check:
subtests=__COMMENT_EXISTS,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__HAS_MSGID,__HAS_SUBJECT,__MIME_HTML,__MIME_QP,__MIME_VERSION,__MSGID_OK_DIGITS,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML
2004-12-08 12:31:52.653926500 debug: [29170] call_plugins: run callback on check_end

i added the 'run callback' dbg message in SpamAssassin.pm when i was trying to
get deeper into the code to find where it was hanging.   the common thing i find
among all children that hang like this is that it appears to happen in
PluginHandler::callback()...   i'm going to dbg that function out to see if that
is where its hanging up.   it has to be somewhere within the foreach loops or
the call to &$methodref i would guess.

    foreach my $plugin (@{$self->{plugins}}) {

    foreach my $cbpair (@{$self->{cbs}->{$subname}}) {
 
    $ret = &$methodref ($plugin, @_);

if it made it through those, it should have gave me a logmsg() from spamd.  a
normal message process currently debugs for me like this...

2004-12-08 12:29:37.523345500 debug: [29170] check:
subtests=__COMMENT_EXISTS,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__HAS_MSGID,__HAS_SUBJECT,__MIME_BASE64,__MIME_HTML,__MIME_VERSION,__MSGID_OK_DIGITS,__MSGID_OK_HOST,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_CENTER,__TAG_EXISTS_HTML
2004-12-08 12:29:37.523390500 debug: [29170] call_plugins: run callback on check_end
2004-12-08 12:29:37.523392500 logmsg: [29170] clean message (1.7/7.0) for
nuthut@localhost:0 in 3.0 seconds, 18052 bytes.
2004-12-08 12:29:37.523395500 logmsg: [29170] result: .  1 -
AWL,DCC_CHECK,HTML_IMAGE_RATIO_04,HTML_MESSAGE,HTML_TAG_EXIST_TBODY,MIME_BASE64_TEXT,NO_REAL_NAME
scantime=3.0,size=18052,mid=<110253057458929194@localhost>,autolearn=disabled
2004-12-08 12:29:37.523438500 logmsg: [29170] calling status->finish
2004-12-08 12:29:37.523440500 debug: [29170] generic: call_plugins per_msg_finish
2004-12-08 12:29:37.523442500 debug: [29170] call_plugins: run callback on
per_msg_finish
2004-12-08 12:29:37.523461500 debug: [29170] generic: call_plugins
per_msg_finish complete
2004-12-08 12:29:37.523463500 debug: [29170] generic: done deleting self keys
2004-12-08 12:29:37.523465500 logmsg: [29170] calling mail->finish
2004-12-08 12:29:37.523467500 debug: [29170] finish: called for Message.pm
2004-12-08 12:29:37.523469500 debug: [29170] finish: completed for Message.pm

so after a check_end, it should give me the result and start cleaning up.  man i
wish i could reproduce this and not wait for it to happen :)

Comment 140 Theo Van Dinter 2004-12-08 11:13:36 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Wed, Dec 08, 2004 at 11:10:51AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 2004-12-08 12:31:52.653926500 debug: [29170] call_plugins: run callback on check_end

Hrm.  Just to check, are you using any plugins and do check_end?  None of the
standard ones do.

Comment 141 Dallas Engelken 2004-12-08 11:25:25 UTC

nah, its pretty much stock, even the rules.  init.pre has the basic ones

loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
loadplugin Mail::SpamAssassin::Plugin::Hashcash
loadplugin Mail::SpamAssassin::Plugin::SPF
loadplugin Mail::SpamAssassin::Plugin::Razor2

Comment 142 Dallas Engelken 2004-12-08 11:27:56 UTC

PMS::check() calls a 'check_end', line 252 in v3.0.1

  $self->{main}->call_plugins ("check_end", { permsgstatus => $self });

Comment 143 Theo Van Dinter 2004-12-08 11:56:13 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Wed, Dec 08, 2004 at 11:27:57AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> PMS::check() calls a 'check_end', line 252 in v3.0.1
>   $self->{main}->call_plugins ("check_end", { permsgstatus => $self });

Yeah, it gets called, I was just wondering if the problem was a plugin's
check_end function was getting locked up for some reason.

If there are no plugins with that function, in theory, the problem is actually
in the SA call_plugins() area.

Comment 144 Justin Mason 2004-12-08 13:03:39 UTC

there's quite a bit of code between check_end and that next logmsg.  Notably this:

    # Build the message to send back and measure it
    my $msg_resp        = $status->rewrite_mail();
    my $msg_resp_length = length($msg_resp);
    if ( $version >= 1.3 )    # Spamc protocol 1.3 means multi hdrs are OK
    {
      syswrite( $client, "SPAMD/1.1 $resphash{$resp} $resp\r\n" .
        "Content-length: $msg_resp_length\r\n" . $spamhdr . "\r\n\r\n" .
        $msg_resp );
    }
    elsif (
      $version >= 1.2 )    # Spamc protocol 1.2 means it accepts content-length
    {
      syswrite( $client, "SPAMD/1.1 $resphash{$resp} $resp\r\n" .
        "Content-length: $msg_resp_length\r\n\r\n" . $msg_resp );
    }
    else                   # Earlier than 1.2 didn't accept content-length
    {
      syswrite( $client, "SPAMD/1.0 $resphash{$resp} $resp\r\n" . $msg_resp );
    }


IOW, syswrite() calls to send data to the client.  I think that's more likely,
since it involves socket I/O.

Comment 145 Dallas Engelken 2004-12-08 13:15:38 UTC

k.. so i'll add some more logmsg() calls in between check_end and the next
logmsg so we can pinpoint this.   next time i happens we should be really close.

Comment 146 Justin Mason 2004-12-08 14:35:17 UTC

Created attachment 2558 [details]
new patch using $] instead of $^V

a new version of patch 2553 using $] to support 5.005 (not that we do anyway,
but it's easy enough ;)

Comment 147 Malte S. Stretz 2004-12-09 08:39:51 UTC

/me smells code duplication here :~)  But this can be refactored in trunk, the 
patch itself looks good from the first glance, I'll apply it to my local spamd 
and keep that one running with the patch.

Comment 148 Tom Parkison 2004-12-10 13:11:18 UTC

root@edinburgh [~/spamassassin/Mail-SpamAssassin-3.0.1]# patch -p0<spamd.patch
patching file lib/Mail/SpamAssassin/Dns.pm
Hunk #1 succeeded at 668 with fuzz 1 (offset 188 lines).
Hunk #2 FAILED at 711.
Hunk #4 FAILED at 643.
Hunk #5 succeeded at 929 (offset 188 lines).
Hunk #6 FAILED at 965.
3 out of 6 hunks FAILED -- saving rejects to file 
lib/Mail/SpamAssassin/Dns.pm.rej
patching file lib/Mail/SpamAssassin/Reporter.pm
Hunk #1 succeeded at 283 (offset 133 lines).
Hunk #2 FAILED at 305.
Hunk #4 FAILED at 228.
2 out of 4 hunks FAILED -- saving rejects to file 
lib/Mail/SpamAssassin/Reporter.pm.rej
patching file lib/Mail/SpamAssassin/Locker/Flock.pm
Hunk #1 succeeded at 63 with fuzz 1 (offset -4 lines).
Hunk #2 succeeded at 76 with fuzz 2.
Hunk #3 FAILED at 97.
1 out of 3 hunks FAILED -- saving rejects to file 
lib/Mail/SpamAssassin/Locker/Flock.pm.rej
patching file lib/Mail/SpamAssassin/Util.pm
Hunk #1 succeeded at 1089 with fuzz 2 (offset -40 lines).
patching file lib/Mail/SpamAssassin/Plugin/SPF.pm
Hunk #1 FAILED at 222.
1 out of 1 hunk FAILED -- saving rejects to file 
lib/Mail/SpamAssassin/Plugin/SPF.pm.rej
can't find file to patch at input line 362
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: lib/Mail/SpamAssassin/Plugin/Razor2.pm
|===================================================================
|--- lib/Mail/SpamAssassin/Plugin/Razor2.pm     (revision 111184)
|+++ lib/Mail/SpamAssassin/Plugin/Razor2.pm     (working copy)
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
6 out of 6 hunks ignored
patching file spamd/spamd.raw
Hunk #1 succeeded at 79 with fuzz 1 (offset -3 lines).
Hunk #2 succeeded at 180 (offset -10 lines).
Hunk #3 succeeded at 508 (offset -10 lines).
Hunk #4 succeeded at 527 with fuzz 2 (offset -10 lines).
Hunk #5 FAILED at 559.
Hunk #6 FAILED at 707.
Hunk #7 succeeded at 983 (offset -81 lines).
Hunk #8 succeeded at 1095 (offset -10 lines).
Hunk #9 succeeded at 1392 (offset -84 lines).
Hunk #10 succeeded at 2364 with fuzz 2 (offset -61 lines).
2 out of 10 hunks FAILED -- saving rejects to file spamd/spamd.raw.rej

What is going on?

Comment 149 Theo Van Dinter 2004-12-10 13:16:17 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Fri, Dec 10, 2004 at 01:11:19PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> |Index: lib/Mail/SpamAssassin/Plugin/Razor2.pm
> 
> What is going on?

You're applying a patch meant for trunk (3.1.0) against 3.0.1, which won't
work in this case.

Comment 150 Tom Parkison 2004-12-13 08:32:20 UTC

Which "trunk" version do you recommend?

Comment 151 Theo Van Dinter 2004-12-13 09:01:43 UTC

Subject: Re:  [review] spamd parent stops accepting requests

On Mon, Dec 13, 2004 at 08:32:23AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Which "trunk" version do you recommend?

There is only 1 trunk. :)

Comment 152 Tom Parkison 2004-12-13 13:21:02 UTC

HUH!  There are four trunk versions available for downloading...
http://cvs.apache.org/snapshots/spamassassin/

"Nightly snapshots of the latest SVN trunk code can be downloaded from 
cvs.apache.org. These are built nightly from the SVN repository, and may be 
unstable!"

There you go, "trunk" versions.

Comment 153 Justin Mason 2004-12-13 19:08:07 UTC

we really need user confirmation that the patch helps before we can mark this as
[review], so taking it out of that state

Comment 154 Dallas Engelken 2004-12-14 12:56:16 UTC

well, i've finally debugged this thing completely down as far as i can go
without jumping into Storable further.

looks like it hangs on a Storable::dclone() call in SpamAssassin.pm..  not sure
if it was the first dclone() or second in the elsif, i can debug that further if
needed, but it'll be another few days before it happens again i believe.

2004-12-14 13:06:47.896133500 debug: [30830] generic: while k=bayes_sql_username
and v= and i= in source
2004-12-14 13:06:47.896154500 debug: [30830] generic: while k=whitelist_to and
v=HASH(0x937f5fc) and i=HASH in source
2004-12-14 13:06:47.896159500 debug: [30830] generic: calling dclone of v
2004-12-14 13:06:47.896248500 debug: [30830] generic: while
k=hashcash_doublespend_path and v=__userstate__/hashcash_seen and i= in source
2004-12-14 13:06:47.896316500 debug: [30830] generic: while k=whitelist_from and
v=HASH(0x9345588) and i=HASH in source
2004-12-14 13:06:47.896321500 debug: [30830] generic: calling dclone of v
2004-12-14 13:06:47.896424500 debug: [30830] generic: while k=pyzor_timeout and
v=3 and i= in source
2004-12-14 13:06:47.896448500 debug: [30830] generic: while
k=hashcash_doublespend_file_mode and v=0700 and i= in source
2004-12-14 13:06:47.896469500 debug: [30830] generic: while k=lock_method and v=
and i= in source
2004-12-14 13:06:47.896490500 debug: [30830] generic: while k=bayes_file_mode
and v=0700 and i= in source
2004-12-14 13:06:47.896511500 debug: [30830] generic: while
k=user_scores_ldap_password and v= and i= in source
2004-12-14 13:06:47.896515500 debug: [30830] generic: while k=pyzor_max and v=5
and i= in source
2004-12-14 13:06:47.896535500 debug: [30830] generic: while k=tflags and
v=HASH(0x9346be4) and i=HASH in source
2004-12-14 13:06:47.896558500 debug: [30830] generic: calling dclone
2004-12-14 13:15:00.400402500 logmsg: [1222] handled cleanup of child pid 30830

Comment 155 Dallas Engelken 2004-12-14 20:41:59 UTC

okay, exact same loop point again...  its happening in the dclone call while
k=tflags.


2004-12-14 18:54:08.281807500 debug: [2650] generic: while k=whitelist_from and
v=HASH(0x934ce00) and i=HASH in source
2004-12-14 18:54:08.281875500 debug: [2650] generic: while k=pyzor_timeout and
v=3 and i= in source
2004-12-14 18:54:08.281899500 debug: [2650] generic: while
k=hashcash_doublespend_file_mode and v=0700 and i= in source
2004-12-14 18:54:08.281902500 debug: [2650] generic: while k=lock_method and v=
and i= in source
2004-12-14 18:54:08.281926500 debug: [2650] generic: while k=bayes_file_mode and
v=0700 and i= in source
2004-12-14 18:54:08.281948500 debug: [2650] generic: while
k=user_scores_ldap_password and v= and i= in source
2004-12-14 18:54:08.281951500 debug: [2650] generic: while k=pyzor_max and v=5
and i= in source
2004-12-14 18:54:08.282002500 debug: [2650] generic: while k=tflags and
v=HASH(0x934e074) and i=HASH in source
2004-12-14 18:54:08.282002700 debug: [2650] generic: calling dclone in elsif
2004-12-14 19:00:01.453042500 logmsg: [5095] handled cleanup of child pid 2650


what can i do from here?

Comment 156 Justin Mason 2004-12-14 22:04:13 UTC

ok, that makes sense -- Storable is a recent addition to spamd's operation, and
it's mostly (if not all) XS code.

Dallas -- I don't have tuits right now, but could you try wrapping the spamd
call to copy_config (around line 1000) in an alarm?  it *may* be able to catch
that.  if SIGALRM goes off after 10 secs without copy_config() having completed,
it's in trouble -- and the child can just exit early, and the master spamd will
respawn a new one.  that should work around it until we can eventually ditch the
use of Storable.

Comment 157 Tom Schulz 2004-12-15 08:11:51 UTC

What version of Storable do you have?  Most likely the latest one, but I
thought I would ask anyway.

Comment 158 Dallas Engelken 2004-12-15 08:16:04 UTC

someone didnt read the whole bug :)  see comment #128

Comment 159 Tom Schulz 2004-12-15 10:20:55 UTC

I looked for that, but there are too many comments!  I have Storable-2.13
installed here.  It might be worthwhile to try a newer version.

Comment 160 Justin Mason 2004-12-15 14:31:49 UTC

Created attachment 2566 [details]
wrap copy_config() call in an alarm

ok, here's a patch that does what I suggested.	if the copy_config() call takes
over 10 seconds (ie. it's hung), the child logs a message and exits
immediately, so the master can respawn a new one.

Comment 161 Daniel Quinlan 2004-12-15 16:06:50 UTC

Moving to 3.0.3 milestone.  The bug is no longer in review status and we
should get 3.0.2 out soon.

Comment 162 Justin Mason 2004-12-15 16:24:20 UTC

hmm... not sure I like that.   I think we may have a fix here, and I'd like to
get it in if it works...

(I'd be fine with leaving it to .3, though, if it doesn't.)

Comment 163 Dallas Engelken 2004-12-20 07:05:04 UTC

215k calls to copy_config without one loop...  that tells me Storable 2.13 is
most likely the fix, as the alarm surrounding copy_config() has never triggered
since.

fyi, i had that function writing a log file on success/fails... 

[root@email spamd]# grep exit copy_config.log
[root@email spamd]# wc -l < copy_config.log
 215706

can other that experience this please post your Storable version, and then try
updating if you are not on 2.13?

leaving copy_config() alarm'd is probably a good thing just in case people dont
have the latest storable running.  or i guess 2.13 could be required in the
Makefile?

d

Comment 164 Justin Mason 2005-01-12 19:07:12 UTC

ok, marking this closed.  the resolution: ensure the version of Storable is > 2.13.

Comment 165 Loren Wilton 2005-01-13 23:27:18 UTC

Subject: Re:  spamd parent stops accepting requests

> ok, marking this closed.  the resolution: ensure the version of Storable
is > 2.13.

I hope that Storable was part of the correct-version-checking patch... :-)

Comment 166 Justin Mason 2005-01-14 10:27:24 UTC

good point Loren ;)   r125194.

btw, when I said versions > 2.13, I mean versions >= 2.13.  oops.

Comment 167 Przemyslaw Peron 2005-01-15 16:47:36 UTC

Storable $VERSION = '2.13' here and I'm still experiencing this problem. :(

Comment 168 Tom Parkison 2005-02-10 17:57:17 UTC

Ok, I have another problem.

I am running a "snapshot" build of SpamAssassin and it has solved a few issues,
specifically, the mailbox locks and the SMTP 421: Message Abandoned error.  We
have been able to go several days without the issue.

Unfortunately, there is a new problem.  After a few days, SpamAssassin forgets
how to filter mail.  It is as if SpamAssassin no longer knows how to kill spam.
 No tests are done, nothing but the SpamAssassin headers are added saying that
no tests were done to the mail message.

Killed one bug, another one surfaced.

Comment 169 Justin Mason 2005-02-10 18:19:44 UTC

Tom, please open a separate bug for that issue -- it's important that issues are
kept separate.

(also, when you do so, be sure to attach a sample of what a problematic message
looks like after scanning, with all headers. and note how exactly you're running
SA -- through amavisd? MailScanner? from procmail directly? etc.)

Comment 170 Tim Dijkstra 2005-03-17 05:58:05 UTC

I think I'm seeing the same problems as other people in this report. Every once
in a while I find my mail-system in a state where netstat show a lot lines like

tcp      755      0 localhost:spamd         localhost:36353  CLOSE_WAIT
-

Also spamd doesn't except anymore new connections. Note that this kind of
behaviour also happens with sockets, also with only local tests (-L). 

I tried with Storable-2.12/2.13 that doesn't help. The timeout patch does help,
I haven't seen any blocks like this anymore (in the limited testing period that
is).