Bug 6304 - [review] spamd is spawning and killing processes too often
Summary: [review] spamd is spawning and killing processes too often
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.3.0
Hardware: Other All
: P3 normal
Target Milestone: 3.3.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: ready to go (with fixes)
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-25 07:29 UTC by Kevin A. McGrail
Modified: 2010-04-05 14:36 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
patch for spamd adjustments to info level + small fix to Makefile.PL text/plain None Kevin A. McGrail [HasCLA]
patch for spamd adjustments to info level + small fix to Makefile.PL - Changed one info call to use %s instead of $pid by accident patch None Kevin A. McGrail [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin A. McGrail 2010-01-25 07:29:15 UTC
Since moving the server to 3.3.0, I'm seeing these in the logs that were not 
there before:

Jan 22 13:10:01 talonjr spamd[8959]: spamd: handled cleanup of child pid 
[8972] due to SIGCHLD: INTERRUPTED, signal 2 (0002)

Are these just more informative?  Since a quick look of 3.2.5 shows the same 
info() line, I'm worried that this isn't good.  I had 0 of these before with 
3.2.5 and and get about 5 per minute with the 3.3.0 cut 2 proposed release.

They look like spawns and then kills:

Jan 22 13:25:44 talonjr spamd[8959]: spamd: server successfully spawned 
child process, pid 9141
Jan 22 13:25:44 talonjr spamd[8959]: spamd: handled cleanup of child pid 
[9141] due to SIGCHLD: INTERRUPTED, signal 2 (0002)


After running for a few days in production I have voted to release 3.3.0.  However, I believe the root cause and meaning of these messages needs to be determined.

In 3.3.0, the relevant code changes are a single line with a reference change:

3.3.0:
unless 
($Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) {

3.2.5 (just one $ change
unless 
($$Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) {

That small change doesn't seem to be the cause.

Just in case I added info ("spamd: %s vs %s", 
$$Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}, 
$Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER});

gave me Jan 22 14:30:13 talonjr spamd[13043]: spamd: vs 0  with a 
uninitialied value error.

Looking at this makes me wonder if we have a timing change on min-spares 
causing the issue?

Jan 22 14:30:46 talonjr spamd[13043]: spamd: server successfully spawned 
child process, pid 13087
Jan 22 14:30:46 talonjr spamd[13043]: prefork: child states: BBIIIII
Jan 22 14:30:50 talonjr spamd[13043]: prefork: child states: BIIIIII
Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: BIIIIII
Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: IIIIIII
Jan 22 14:30:54 talonjr spamd[13043]: Use of uninitialized value in sprintf 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/Logger.pm line 
241.
Jan 22 14:30:54 talonjr spamd[13043]: spamd: vs 0
Jan 22 14:30:54 talonjr spamd[13043]: spamd: handled cleanup of child pid 
[13087] due to SIGCHLD: INTERRUPTED, signal 2 (0002)
Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: IIIIII

Here's the spamd command line I'm running

 /usr/local/bin/spamd -d --min-spare=5 --min-children=5 --max-conn-per-child=1000 
 --max-children=40 -q -x -u spamd --allowed-ips=127.0.0.1,removed -i 
removed -r /var/run/spamd.pid
Comment 1 Kevin A. McGrail 2010-01-25 07:29:43 UTC
Mark Martinec:

The message now includes the exit status of a finished child process.
Previously child process crashes or other abnormal terminations
were indistinguishable from regular child process terminations, which
let one to believe that everything is alright even when it wasn't,
and diagnosing the problem was difficult.

r810883 | mmartinec | 2009-09-03 12:45:38 +0200 (Thu, 03 Sep 2009)
Bug 6150, Bug 6127, Bug 5981, Bug 5950, Bug 6191: let spamd log/report
a child process exit status or aborting condition in an informative way
Comment 2 Kevin A. McGrail 2010-01-25 07:30:29 UTC
Tom Schultz:

On our server running 3.2.5, I see pairs of entries such as the following:

Jan 17 05:29:20 mackerel.adi.com spamd[2605]: spamd: server successfully
spawned child process, pid 3989
Jan 17 05:29:22 mackerel.adi.com spamd[2605]: spamd: handled cleanup of
child pid 3989 due to SIGCHLD 

In this case this child did process one message between the spawn and
cleanup. The time between spawn and cleanup varies from one second to
a few hours. With 3.3.0, I see the same pattern but with the above slightly
more verbose message.
Comment 3 Kevin A. McGrail 2010-01-25 07:30:53 UTC
Daryl C.W. O'Shea:

I think in both Tom and Kevin's cases this is caused by their min-spares
and max-spares settings.  Tom didn't show his settings, but Kevin's got
min-spares=5 and max-spares=6.  That means that if two children finish
their work while another is being spawned, that new child is going to be
killed off immediately.

If the same thing were to be happening with settings of, say,
min-spares=5 and max-spares=10, or min-spares=3 and max-spares=8 I would
be concerned.  At present, though, I am not concerned.
Comment 4 Kevin A. McGrail 2010-01-25 07:31:20 UTC
Mark Martinec:

Don't know, I can't reproduce it here. Appears like a fight
between sub need_to_del_server and need_to_add_server.

Try adding a line like:
  info("TEST1/2: INT %s", $pid);
before each of the two calls to kill() in SpamdForkScaling.pm,
and perhaps also instrument with info() both need_to_* subs.
Comment 5 Kevin A. McGrail 2010-01-25 10:58:58 UTC
Created attachment 4657 [details]
patch for spamd adjustments to info level + small fix to Makefile.PL

Mark,

I looked at bug 6127 and see the changes you are referring to:

@@ -2419,8 +2420,9 @@
     }
 
     unless ($Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) {
-      info("spamd: handled cleanup of child pid $pid ".
-                ((defined $sig) ? "due to SIG$sig" : ""));
+      info("spamd: handled cleanup of child pid [%s]%s: %s",
+           $pid, (defined $sig ? " due to SIG$sig" : ""),
+           exit_status_str($child_stat,0));


A review of my older logs does show I was seeing the older format cleanup handling sans the extra exit status.  So I think we have answered the question of where the more dire messages are coming from in 3.3.0

HOWEVER, this issue does appear to point out that spamd is spawning and killing children too often.  So I switched to checking if that was a bug or my installation.

Since my command line did NOT specify a max-spare, I think from my review that setting min-spare to 5 was a mistake.   

I've changed to min-spare to the default of 1 and increased the max-spare making the relevant options now:

--min-spare=1 --min-children=5 --max-spare=10 --max-children=40

And the scary messages are now decreased greatly and the process that are started are running a scan before being killed.  So I think the scary message is indicative of poor choice of spamd parameters.

The attached patch adds adjustments to info level rather than just debug.  I also found a bug in make distclean that left two raw conversions behind so there is a fix in the Makefile.PL for that minor issue.

What I think is ALSO needed is also a patch to the error messages, the spamd POD or a README explaining that if you are seeing processes started and killed too often, you may need to tweak your spamd parameters.

For me, monitoring with the command 'tail /var/log/maillog -f | grep -i prefork' appears to show that I've now done a much better job picking the parameters.

For example, now I see the server getting busy, no idles available, spamd children spawned, the spawned children are used and then killed more properly:

Jan 25 13:56:23 talonjr spamd[28135]: prefork: child states: IBBIIIIIII 
Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBBBBBBII 
Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBIBBBBII 
Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBIBBBIII 
Jan 25 13:56:28 talonjr spamd[28135]: prefork: child states: BBBBIBBIII 
Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBIBBBBB 
Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBBIBBBB 
Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBBIIBBB 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBB 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28353 started. 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBB 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28354 started. 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBI 
Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: IBBBBBBBBBBI 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBB 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28355 started. 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBBB 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28356 started. 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBBBI 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBIBBBBBI 
Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBIBBBI 
Jan 25 13:56:33 talonjr spamd[28135]: prefork: child states: BBBBIBBBBBBBBI 
Jan 25 13:56:34 talonjr spamd[28135]: prefork: child states: BBBBIIBBBBBBBI 
Jan 25 13:56:34 talonjr spamd[28135]: prefork: child states: BBBBIIIBBBBBBI 
Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: BBBBIIIBBBIBBI 
Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: IBBBIIIBBBIBBI 
Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: IBBBIIIBBBIIBI 
Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBBIIIBIBIIBI 
Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBBIIIBIBIIII 
Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBIIIIBIBIIII 
Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBIIIIIIBIIII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: BIBIIIIIIBIIII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28356 killed. 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: BIBIIIIIIBIII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIBIII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28355 killed. 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIBII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIIII 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28354 killed. 
Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIII 
Jan 25 13:56:39 talonjr spamd[28135]: prefork: child states: IIBIIIIIIII 
Jan 25 13:56:45 talonjr spamd[28135]: prefork: child states: BBBIIIIIIII 
Jan 25 13:56:46 talonjr spamd[28135]: prefork: child states: IBBIIIIIIII 
Jan 25 13:56:46 talonjr spamd[28135]: prefork: child states: IBBIIIIIIII
Comment 6 Kevin A. McGrail 2010-01-25 11:35:31 UTC
Another nice addition would be on need_to_del_server being able to give the number of connections (or emails) processed before the child was killed.  If the number is 0 than that might indicate an issue more specifically.

I checked the $self hash and didn't see anything obvious that might have this information:

Jan 25 14:28:06 talonjr spamd[31376]: generic: key child_just_exited - value 0 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_last_ping - value 1264447652 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key lowest_idle_pid - value 31377 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key overloaded - value 0 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key kids - value HASH(0x895d6ac) 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key max_idle - value 10 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key min_idle - value 5
Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_fileno - value ARRAY(0xa336af0) 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key cur_children_ref - value SCALAR(0x888f220) 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key max_children - value 40 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key min_children - value 5 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_fh - value ARRAY(0xa3368e0) 
Jan 25 14:28:06 talonjr spamd[31376]: generic: key backchannel - value Mail::SpamAssassin::SubProcBackChannel=HASH(0x827c7bc) 

Any thoughts?
Comment 7 Mark Martinec 2010-02-19 18:38:17 UTC
--- lib/Mail/SpamAssassin/Util.pm       (revision 911897)
+++ lib/Mail/SpamAssassin/Util.pm       (working copy)
-             $sig == 1 ? 'HANGUP' : $sig == 2 ? 'INTERRUPTED' :
+             $sig == 1 ? 'HANGUP' : $sig == 2 ? 'interrupted' :


trunk:
Bug 6304 (trivial), reduce sysadmin's stress level by lowercasing
the 'INTERRUPTED' in a logged message:
 spamd: handled cleanup of child pid [...] due to SIGCHLD: INTERRUPTED
Sending        lib/Mail/SpamAssassin/Util.pm
Committed revision 911897.

3.3:
Sending        lib/Mail/SpamAssassin/Util.pm
Committed revision 911900.
Comment 8 Mark Martinec 2010-03-01 17:29:42 UTC
> Needs 2 more votes to commit patch from comment 5

+1

Looks ok, with one nitpick:

> info("prefork: adjust: %s idle children more than %s maximum idle children.
>  Decreasing spamd children: $pid killed.", $num_idle, $self->{max_idle}, $pid);

Please never mix the sprintf-style (format, with arguments) with classical
style (no arguments) of calls to info() and dbg().  As soon as extra arguments
are specified, the first one is interpreted as a sprintf format, and interpolated
variables should generally not be allowed there (without a good reason).
Comment 9 Justin Mason 2010-03-02 17:12:35 UTC
+1 as per Mark's comments
Comment 10 Kevin A. McGrail 2010-03-02 18:29:00 UTC
Created attachment 4690 [details]
patch for spamd adjustments to info level + small fix to Makefile.PL - Changed one info call to use %s instead of $pid by accident

(In reply to comment #9)
> +1 as per Mark's comments

it's actually (In reply to comment #8)
> > Needs 2 more votes to commit patch from comment 5
> 
> +1
> 
> Looks ok, with one nitpick:
> 
> > info("prefork: adjust: %s idle children more than %s maximum idle children.
> >  Decreasing spamd children: $pid killed.", $num_idle, $self->{max_idle}, $pid);
> 
> Please never mix the sprintf-style (format, with arguments) with classical
> style (no arguments) of calls to info() and dbg().  As soon as extra arguments
> are specified, the first one is interpreted as a sprintf format, and
> interpolated
> variables should generally not be allowed there (without a good reason).

Ahh, I didn't mean to do that.  That's a bug.  I passed in the $pid to info but should have used %s.  trivial revision in the patch rev2.
Comment 11 Mark Martinec 2010-03-02 18:40:06 UTC
> Created an attachment (id=4690) [details]
> patch for spamd adjustments to info level + small fix to Makefile.PL - Changed
> one info call to use %s instead of $pid by accident

Just apply it, you already have a clearance.
Comment 12 Kevin A. McGrail 2010-03-02 19:06:42 UTC
Sending        Makefile.PL
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Sending        spamd/spamd.raw
Transmitting file data ...
Committed revision 918153.

I'll open a new bug if I can get more number of emails information (comment 6) that would make more useful information for an administrator.  Something like "Spamd killed a child that processed 0 emails. if this continues, you should tweak your parameters".

KAM
Comment 13 Mark Martinec 2010-03-02 19:12:23 UTC
> Committed revision 918153.

Thanks. That was to trunk. Now to the 3.3 branch as well, please.

> I'll open a new bug if I can get more number of emails information (comment 6)
> that would make more useful information for an administrator.  Something like
> "Spamd killed a child that processed 0 emails. if this continues, you should
> tweak your parameters".

Good.
Comment 14 Kevin A. McGrail 2010-03-03 13:34:30 UTC
> Thanks. That was to trunk. Now to the 3.3 branch as well, please.

Sending        Makefile.PL
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Sending        spamd/spamd.raw
Transmitting file data ...
Committed revision 918467.
Comment 15 Micah 2010-04-05 14:33:26 UTC
(In reply to comment #12)
> Sending        Makefile.PL
> Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
> Sending        spamd/spamd.raw
> Transmitting file data ...
> Committed revision 918153.
> 
> I'll open a new bug if I can get more number of emails information (comment 6)
> that would make more useful information for an administrator.  Something like
> "Spamd killed a child that processed 0 emails. if this continues, you should
> tweak your parameters".

I would appreciate this sort of message, however it would be even better if "tweak your parameters" gave you more of a hint of which ones to tweak, such as "consider increasing your max-children/min-spares" etc.
Comment 16 Micah 2010-04-05 14:34:47 UTC
(In reply to comment #12)
> Sending        Makefile.PL
> Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
> Sending        spamd/spamd.raw
> Transmitting file data ...
> Committed revision 918153.
> 
> I'll open a new bug if I can get more number of emails information (comment 6)
> that would make more useful information for an administrator.  Something like
> "Spamd killed a child that processed 0 emails. if this continues, you should
> tweak your parameters".

Have you opened this other bug? I can't seem to find it.
Comment 17 Kevin A. McGrail 2010-04-05 14:36:34 UTC
(In reply to comment #16)
> (In reply to comment #12)
> > Sending        Makefile.PL
> > Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
> > Sending        spamd/spamd.raw
> > Transmitting file data ...
> > Committed revision 918153.
> > 
> > I'll open a new bug if I can get more number of emails information (comment 6)
> > that would make more useful information for an administrator.  Something like
> > "Spamd killed a child that processed 0 emails. if this continues, you should
> > tweak your parameters".
> 
> Have you opened this other bug? I can't seem to find it.

We are on the same wavelength but, no, I haven't opened the new bug as I wasn't sure if others had the issue.

Please open a new bug if you would about this and I'll see what I can do!