SA Bugzilla – Bug 5665
[review] spamd keeps dead kids in state 'K', causing child hash to fill up
Last modified: 2007-12-16 13:20:11 UTC
We're running a cluster of 4 spamd servers on Debian etch, amd64. With a recent upgrade to 3.2.3, we've started seeing spamd not notice that exiting children have in fact exited (according to ps and top), and retains a ghost record in the K state. Over time, this fills up spamd's internal child tracking table, and eventually all processing stalls out. With the default values for --min-children, --min-spare, and --max-conn-per-child, the first ghost entry shows up within about 15 minutes. Raising one or several of these in combination seems to make the problem less likely. Each ghost entry can be seen to happen along with a set of log entries like these: prefork: cannot ping 25046, file handle not defined, child likely to still be processing SIGCHLD handler after killing itself prefork: killing failed child 25046 fd=undefined at /opt/spamassassin-3.2.3/share/perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 171. prefork: kill of failed child 25046 failed: No such process prefork: killed child 25046 This appears to be similar to bug 5313, but inverted; the child processes *are* killed successfully according to the OS, but spamd doesn't find out about it. Checking with ps or top shows that the PID in the log has in fact exited. Enabling --round-robin seems to be working around the problem for now, but the overall system load is much higher. SA is installed from source on all four machines by a script set up to keep the installations as close as possible to identical. The Bayes DB is in MySQL on one machine; that system is slightly slower to lose track of its spamd children than the other 3.
Created attachment 4142 [details] Quick hack to clean up ghost K-state children The attached patch seems to be working to eliminate the ghost K-state children; I've patched the four production machines that were showing the problem and all four are stable. One has been running for ~3 hours, where it would have accumulated ~5-8 (possibly more) ghost children without the patch during that time. The patch as-is includes some debug "logging", and could probably be vastly improved.
Kris, if that patch works OK, it looks good to me. Could you monitor it for a few more days and let me know if it's still working, by the end of that? If it is, I'll add the patch to SVN and 3.2.x.
(In reply to comment #2) > Kris, if that patch works OK, it looks good to me. Could you monitor it for a > few more days and let me know if it's still working, by the end of that? If > it is, I'll add the patch to SVN and 3.2.x. ACK OK. FWIW, it's been stable well beyond the point of "spamd ran out of child slots" already, but I'll still watch it for another day or so to make sure it doesn't eat the servers or stomp all over something else. On one machine I'm seeing the "prefork: debug:" notes every ~3 minutes. O_o I honestly can't tell whether this is just papering over the "real" problem somewhere else, or doing exactly what I intended and providing a little extra cleanup where it's needed.
(In reply to comment #3) > FWIW, it's been stable well beyond the point of "spamd ran out of child slots" > already, but I'll still watch it for another day or so to make sure it doesn't > eat the servers or stomp all over something else. Still stable on all four machines that were showing the problem. None have needed spamd restarted since I applied the patch; unpatched spamd would run out of child slots within 6-8 hours at most. No apparent problems with any other services (not that there's much else beyond SA). No zombie children left hanging around where there shouldn't be.
Created attachment 4143 [details] minor tweak Kris, could you try this version of the patch? it removes a redundant delete_socket_for_child() call and quiets down the debugging, but otherwise should be exactly the same.
(In reply to comment #5) > Created an attachment (id=4143) [edit] > minor tweak > > Kris, could you try this version of the patch? it removes a redundant > delete_socket_for_child() call and quiets down the debugging, but otherwise > should be exactly the same. Seems to be working; one system is stable for 3 hours so far. Unpatched, ghost children usually show up within 15-20 minutes.
ok, applied to 3.3.0: : jm 189...; svn commit -m "bug 5665: spamd may fail to notice that a child has completed exiting, and keeps it in the child list in state 'K', eventually filling up the child list with 'ghost' children. fix" lib/Mail/SpamAssassin/SpamdForkScaling.pm Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Transmitting file data . Committed revision 582610. committers, votes please...
From my memory of how SpamdForkScaling works it looks safe, so +1.
+1
fix checked in for 3.2.x: r604706