SA Bugzilla – Bug 6304
[review] spamd is spawning and killing processes too often
Last modified: 2010-04-05 14:36:34 UTC
Since moving the server to 3.3.0, I'm seeing these in the logs that were not there before: Jan 22 13:10:01 talonjr spamd[8959]: spamd: handled cleanup of child pid [8972] due to SIGCHLD: INTERRUPTED, signal 2 (0002) Are these just more informative? Since a quick look of 3.2.5 shows the same info() line, I'm worried that this isn't good. I had 0 of these before with 3.2.5 and and get about 5 per minute with the 3.3.0 cut 2 proposed release. They look like spawns and then kills: Jan 22 13:25:44 talonjr spamd[8959]: spamd: server successfully spawned child process, pid 9141 Jan 22 13:25:44 talonjr spamd[8959]: spamd: handled cleanup of child pid [9141] due to SIGCHLD: INTERRUPTED, signal 2 (0002) After running for a few days in production I have voted to release 3.3.0. However, I believe the root cause and meaning of these messages needs to be determined. In 3.3.0, the relevant code changes are a single line with a reference change: 3.3.0: unless ($Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) { 3.2.5 (just one $ change unless ($$Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) { That small change doesn't seem to be the cause. Just in case I added info ("spamd: %s vs %s", $$Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}, $Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}); gave me Jan 22 14:30:13 talonjr spamd[13043]: spamd: vs 0 with a uninitialied value error. Looking at this makes me wonder if we have a timing change on min-spares causing the issue? Jan 22 14:30:46 talonjr spamd[13043]: spamd: server successfully spawned child process, pid 13087 Jan 22 14:30:46 talonjr spamd[13043]: prefork: child states: BBIIIII Jan 22 14:30:50 talonjr spamd[13043]: prefork: child states: BIIIIII Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: BIIIIII Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: IIIIIII Jan 22 14:30:54 talonjr spamd[13043]: Use of uninitialized value in sprintf at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/Logger.pm line 241. Jan 22 14:30:54 talonjr spamd[13043]: spamd: vs 0 Jan 22 14:30:54 talonjr spamd[13043]: spamd: handled cleanup of child pid [13087] due to SIGCHLD: INTERRUPTED, signal 2 (0002) Jan 22 14:30:54 talonjr spamd[13043]: prefork: child states: IIIIII Here's the spamd command line I'm running /usr/local/bin/spamd -d --min-spare=5 --min-children=5 --max-conn-per-child=1000 --max-children=40 -q -x -u spamd --allowed-ips=127.0.0.1,removed -i removed -r /var/run/spamd.pid
Mark Martinec: The message now includes the exit status of a finished child process. Previously child process crashes or other abnormal terminations were indistinguishable from regular child process terminations, which let one to believe that everything is alright even when it wasn't, and diagnosing the problem was difficult. r810883 | mmartinec | 2009-09-03 12:45:38 +0200 (Thu, 03 Sep 2009) Bug 6150, Bug 6127, Bug 5981, Bug 5950, Bug 6191: let spamd log/report a child process exit status or aborting condition in an informative way
Tom Schultz: On our server running 3.2.5, I see pairs of entries such as the following: Jan 17 05:29:20 mackerel.adi.com spamd[2605]: spamd: server successfully spawned child process, pid 3989 Jan 17 05:29:22 mackerel.adi.com spamd[2605]: spamd: handled cleanup of child pid 3989 due to SIGCHLD In this case this child did process one message between the spawn and cleanup. The time between spawn and cleanup varies from one second to a few hours. With 3.3.0, I see the same pattern but with the above slightly more verbose message.
Daryl C.W. O'Shea: I think in both Tom and Kevin's cases this is caused by their min-spares and max-spares settings. Tom didn't show his settings, but Kevin's got min-spares=5 and max-spares=6. That means that if two children finish their work while another is being spawned, that new child is going to be killed off immediately. If the same thing were to be happening with settings of, say, min-spares=5 and max-spares=10, or min-spares=3 and max-spares=8 I would be concerned. At present, though, I am not concerned.
Mark Martinec: Don't know, I can't reproduce it here. Appears like a fight between sub need_to_del_server and need_to_add_server. Try adding a line like: info("TEST1/2: INT %s", $pid); before each of the two calls to kill() in SpamdForkScaling.pm, and perhaps also instrument with info() both need_to_* subs.
Created attachment 4657 [details] patch for spamd adjustments to info level + small fix to Makefile.PL Mark, I looked at bug 6127 and see the changes you are referring to: @@ -2419,8 +2420,9 @@ } unless ($Mail::SpamAssassin::Logger::LOG_SA{INHIBIT_LOGGING_IN_SIGCHLD_HANDLER}) { - info("spamd: handled cleanup of child pid $pid ". - ((defined $sig) ? "due to SIG$sig" : "")); + info("spamd: handled cleanup of child pid [%s]%s: %s", + $pid, (defined $sig ? " due to SIG$sig" : ""), + exit_status_str($child_stat,0)); A review of my older logs does show I was seeing the older format cleanup handling sans the extra exit status. So I think we have answered the question of where the more dire messages are coming from in 3.3.0 HOWEVER, this issue does appear to point out that spamd is spawning and killing children too often. So I switched to checking if that was a bug or my installation. Since my command line did NOT specify a max-spare, I think from my review that setting min-spare to 5 was a mistake. I've changed to min-spare to the default of 1 and increased the max-spare making the relevant options now: --min-spare=1 --min-children=5 --max-spare=10 --max-children=40 And the scary messages are now decreased greatly and the process that are started are running a scan before being killed. So I think the scary message is indicative of poor choice of spamd parameters. The attached patch adds adjustments to info level rather than just debug. I also found a bug in make distclean that left two raw conversions behind so there is a fix in the Makefile.PL for that minor issue. What I think is ALSO needed is also a patch to the error messages, the spamd POD or a README explaining that if you are seeing processes started and killed too often, you may need to tweak your spamd parameters. For me, monitoring with the command 'tail /var/log/maillog -f | grep -i prefork' appears to show that I've now done a much better job picking the parameters. For example, now I see the server getting busy, no idles available, spamd children spawned, the spawned children are used and then killed more properly: Jan 25 13:56:23 talonjr spamd[28135]: prefork: child states: IBBIIIIIII Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBBBBBBII Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBIBBBBII Jan 25 13:56:27 talonjr spamd[28135]: prefork: child states: IBBIBBBIII Jan 25 13:56:28 talonjr spamd[28135]: prefork: child states: BBBBIBBIII Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBIBBBBB Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBBIBBBB Jan 25 13:56:29 talonjr spamd[28135]: prefork: child states: BBBBBIIBBB Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBB Jan 25 13:56:31 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28353 started. Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBB Jan 25 13:56:31 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28354 started. Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBI Jan 25 13:56:31 talonjr spamd[28135]: prefork: child states: IBBBBBBBBBBI Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBB Jan 25 13:56:32 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28355 started. Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBBB Jan 25 13:56:32 talonjr spamd[28135]: prefork: adjust: 0 idle children less than 1 minimum idle children. Increasing spamd children: 28356 started. Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBBBBBI Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBIBBBBBI Jan 25 13:56:32 talonjr spamd[28135]: prefork: child states: BBBBBBBBBIBBBI Jan 25 13:56:33 talonjr spamd[28135]: prefork: child states: BBBBIBBBBBBBBI Jan 25 13:56:34 talonjr spamd[28135]: prefork: child states: BBBBIIBBBBBBBI Jan 25 13:56:34 talonjr spamd[28135]: prefork: child states: BBBBIIIBBBBBBI Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: BBBBIIIBBBIBBI Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: IBBBIIIBBBIBBI Jan 25 13:56:35 talonjr spamd[28135]: prefork: child states: IBBBIIIBBBIIBI Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBBIIIBIBIIBI Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBBIIIBIBIIII Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBIIIIBIBIIII Jan 25 13:56:36 talonjr spamd[28135]: prefork: child states: BBBIIIIIIBIIII Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: BIBIIIIIIBIIII Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28356 killed. Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: BIBIIIIIIBIII Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIBIII Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28355 killed. Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIBII Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIIII Jan 25 13:56:37 talonjr spamd[28135]: prefork: adjust: 11 idle children more than 10 maximum idle children. Decreasing spamd children: 28354 killed. Jan 25 13:56:37 talonjr spamd[28135]: prefork: child states: IIBIIIIIIII Jan 25 13:56:39 talonjr spamd[28135]: prefork: child states: IIBIIIIIIII Jan 25 13:56:45 talonjr spamd[28135]: prefork: child states: BBBIIIIIIII Jan 25 13:56:46 talonjr spamd[28135]: prefork: child states: IBBIIIIIIII Jan 25 13:56:46 talonjr spamd[28135]: prefork: child states: IBBIIIIIIII
Another nice addition would be on need_to_del_server being able to give the number of connections (or emails) processed before the child was killed. If the number is 0 than that might indicate an issue more specifically. I checked the $self hash and didn't see anything obvious that might have this information: Jan 25 14:28:06 talonjr spamd[31376]: generic: key child_just_exited - value 0 Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_last_ping - value 1264447652 Jan 25 14:28:06 talonjr spamd[31376]: generic: key lowest_idle_pid - value 31377 Jan 25 14:28:06 talonjr spamd[31376]: generic: key overloaded - value 0 Jan 25 14:28:06 talonjr spamd[31376]: generic: key kids - value HASH(0x895d6ac) Jan 25 14:28:06 talonjr spamd[31376]: generic: key max_idle - value 10 Jan 25 14:28:06 talonjr spamd[31376]: generic: key min_idle - value 5 Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_fileno - value ARRAY(0xa336af0) Jan 25 14:28:06 talonjr spamd[31376]: generic: key cur_children_ref - value SCALAR(0x888f220) Jan 25 14:28:06 talonjr spamd[31376]: generic: key max_children - value 40 Jan 25 14:28:06 talonjr spamd[31376]: generic: key min_children - value 5 Jan 25 14:28:06 talonjr spamd[31376]: generic: key server_fh - value ARRAY(0xa3368e0) Jan 25 14:28:06 talonjr spamd[31376]: generic: key backchannel - value Mail::SpamAssassin::SubProcBackChannel=HASH(0x827c7bc) Any thoughts?
--- lib/Mail/SpamAssassin/Util.pm (revision 911897) +++ lib/Mail/SpamAssassin/Util.pm (working copy) - $sig == 1 ? 'HANGUP' : $sig == 2 ? 'INTERRUPTED' : + $sig == 1 ? 'HANGUP' : $sig == 2 ? 'interrupted' : trunk: Bug 6304 (trivial), reduce sysadmin's stress level by lowercasing the 'INTERRUPTED' in a logged message: spamd: handled cleanup of child pid [...] due to SIGCHLD: INTERRUPTED Sending lib/Mail/SpamAssassin/Util.pm Committed revision 911897. 3.3: Sending lib/Mail/SpamAssassin/Util.pm Committed revision 911900.
> Needs 2 more votes to commit patch from comment 5 +1 Looks ok, with one nitpick: > info("prefork: adjust: %s idle children more than %s maximum idle children. > Decreasing spamd children: $pid killed.", $num_idle, $self->{max_idle}, $pid); Please never mix the sprintf-style (format, with arguments) with classical style (no arguments) of calls to info() and dbg(). As soon as extra arguments are specified, the first one is interpreted as a sprintf format, and interpolated variables should generally not be allowed there (without a good reason).
+1 as per Mark's comments
Created attachment 4690 [details] patch for spamd adjustments to info level + small fix to Makefile.PL - Changed one info call to use %s instead of $pid by accident (In reply to comment #9) > +1 as per Mark's comments it's actually (In reply to comment #8) > > Needs 2 more votes to commit patch from comment 5 > > +1 > > Looks ok, with one nitpick: > > > info("prefork: adjust: %s idle children more than %s maximum idle children. > > Decreasing spamd children: $pid killed.", $num_idle, $self->{max_idle}, $pid); > > Please never mix the sprintf-style (format, with arguments) with classical > style (no arguments) of calls to info() and dbg(). As soon as extra arguments > are specified, the first one is interpreted as a sprintf format, and > interpolated > variables should generally not be allowed there (without a good reason). Ahh, I didn't mean to do that. That's a bug. I passed in the $pid to info but should have used %s. trivial revision in the patch rev2.
> Created an attachment (id=4690) [details] > patch for spamd adjustments to info level + small fix to Makefile.PL - Changed > one info call to use %s instead of $pid by accident Just apply it, you already have a clearance.
Sending Makefile.PL Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Sending spamd/spamd.raw Transmitting file data ... Committed revision 918153. I'll open a new bug if I can get more number of emails information (comment 6) that would make more useful information for an administrator. Something like "Spamd killed a child that processed 0 emails. if this continues, you should tweak your parameters". KAM
> Committed revision 918153. Thanks. That was to trunk. Now to the 3.3 branch as well, please. > I'll open a new bug if I can get more number of emails information (comment 6) > that would make more useful information for an administrator. Something like > "Spamd killed a child that processed 0 emails. if this continues, you should > tweak your parameters". Good.
> Thanks. That was to trunk. Now to the 3.3 branch as well, please. Sending Makefile.PL Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Sending spamd/spamd.raw Transmitting file data ... Committed revision 918467.
(In reply to comment #12) > Sending Makefile.PL > Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm > Sending spamd/spamd.raw > Transmitting file data ... > Committed revision 918153. > > I'll open a new bug if I can get more number of emails information (comment 6) > that would make more useful information for an administrator. Something like > "Spamd killed a child that processed 0 emails. if this continues, you should > tweak your parameters". I would appreciate this sort of message, however it would be even better if "tweak your parameters" gave you more of a hint of which ones to tweak, such as "consider increasing your max-children/min-spares" etc.
(In reply to comment #12) > Sending Makefile.PL > Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm > Sending spamd/spamd.raw > Transmitting file data ... > Committed revision 918153. > > I'll open a new bug if I can get more number of emails information (comment 6) > that would make more useful information for an administrator. Something like > "Spamd killed a child that processed 0 emails. if this continues, you should > tweak your parameters". Have you opened this other bug? I can't seem to find it.
(In reply to comment #16) > (In reply to comment #12) > > Sending Makefile.PL > > Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm > > Sending spamd/spamd.raw > > Transmitting file data ... > > Committed revision 918153. > > > > I'll open a new bug if I can get more number of emails information (comment 6) > > that would make more useful information for an administrator. Something like > > "Spamd killed a child that processed 0 emails. if this continues, you should > > tweak your parameters". > > Have you opened this other bug? I can't seem to find it. We are on the same wavelength but, no, I haven't opened the new bug as I wasn't sure if others had the issue. Please open a new bug if you would about this and I'll see what I can do!