SA Bugzilla – Bug 4603
RFE: Apache::SpamD module, to run spamd from httpd
Last modified: 2006-09-07 16:09:41 UTC
similar to Apache::Qpsmtpd http://www.issociate.de/board/index.php?t=msg&th=79422&rid=0 I was talking to Justin Erenkrantz this weekend, and reportedly this is quite a good win for qpsmtpd; it offloads all the work of talking TCP, processes, signals, etc. to apache. anyway, this bug is a tracker for anyone who might feel the need to take this on...
Created attachment 3147 [details] Apache::SpamD Here is Apache::SpamD which was a brief proof of concept I wrote at last years ApacheCon. It worked enough to return basic things. I never really had the energy to take it much further. One key item is that to run something like this I think you're gonna have to live with some sort of virtual user setup since Apache won't be switching users. No problem for me, one reason why I wanted to be able to write all data to SQL tables.
This is now a project in the Google Summer of Code 2006 -- Radoslaw Zielinski will be working on it, with support from Google! welcome Radoslaw ;)
Created attachment 3587 [details] Early version Attaching something to play with. There is a lot to be done, logging and use preferences to name the most important things. But most of the tricky stuff is figured out. See perldoc apache-spamd.pl for more information. I've benchmarked it with --max-children 5 --min-spare 4 --max-spare 5 -x (spamd started with LC_ALL=C) using Bench-spamd.pl with -c 3; results: * standalone spamd: parsed 2000 messages in 00:05:54 (354.753661 s), 5.6377 msgs/s (338 msgs/min, 20296 msgs/h) parsed 2000 messages in 00:05:56 (356.282878 s), 5.6135 msgs/s (337 msgs/min, 20209 msgs/h) * worker: parsed 2000 messages in 00:03:45 (225.887553 s), 8.8540 msgs/s (531 msgs/min, 31874 msgs/h) parsed 2000 messages in 00:03:52 (232.813661 s), 8.5906 msgs/s (515 msgs/min, 30926 msgs/h) * prefork: parsed 2000 messages in 00:03:12 (192.53676 s), 10.3876 msgs/s (623 msgs/min, 37395 msgs/h) parsed 2000 messages in 00:03:12 (192.775798 s), 10.3747 msgs/s (622 msgs/min, 37349 msgs/h)
Created attachment 3610 [details] beta version Here it is, follows standard instalation procedure (Makefile.PL). Includes some basic tests and documentation. I could use some feedback... :-)
wow, it's almost finished! Nearly feature-complete, at least, from the looks of things. excellent! I see you implemented the feature I requested in bug 4992, to write the apache-spamd.pl config directives to a file; that's cool. thanks! Using IPC::Open3 is a nightmare for portability, btw -- I'm pretty sure it doesn't work on win32 at least -- but maybe there are other issues there anyway? how does it compare to current spamd, in speed terms? Regarding logging. What's the issue? (I couldn't actually spot any logging in that tarball.) Should it be integrated into the main distro, or kept as a separate module with its own Makefile.PL, do you think? (I think I'd prefer to integrate, if possible.) And finally, I think it could do with more documentation and tests ;) a lot of that would probably make more sense after the integration-into-distro question is resolved (e.g. "what README does it go into").
Some comments from the dev list: http://www.nabble.com/forum/ViewPost.jtp?post=5517464&framed=y
Created attachment 3628 [details] RC1 - implemented logging - implemented TELL - separated Mail::SpamAssassin::Spamd - switched to Apache::Test NetSet, timeout (I'm not sure how does Apache behave with SIGALRM with misc MPMs), some more benchmarks and documentation left.
Created attachment 3656 [details] RC2 - --allowed-ips (without using NetSet after all; supports IPv6) - timeout - used `shell` instead of IPC::Run - tweaks & fixes Well... mission complete?
I'm curious to hear Radoslaw's comments as to why he thinks this is faster than running spamd the traditional way. What sorts of bottlenecks did this clear? Did we learn anything here that could more appropriately be applied to spamd instead of relying on apache? In the test results presented in comment #3 we see that the prefork method was more efficient than the worker thread. Did you try running spamd with --round-robin to see if similar results are found there? Looking through the RC2 code I'm wondering if you have a short list of things you think you'd do differently if you felt inclined to rework this. RE: comment #5. For whatever weight my newbie status carries I think this should be a separate package at this point. For one it requires a host of new dependencies to run. Also, there's simply no way it's as stable as it will be in a few months.
(In reply to comment #9) > I'm curious to hear Radoslaw's comments as to why he thinks this is faster than running spamd the > traditional way. What sorts of bottlenecks did this clear? I didn't clear any obvious bottlenecks. Spamd code has evolved for years, patch after patch without a large general cleanup, so it became rather messy. Since I had the luxury of having a reference implementation to work with, I could plan how do I want it organised with a more general view. I have moved code around, been very careful about copying data in memory, avoided few regexps / assignments / control statements... but that gave very little speed up, if any. I have focused on correctness and clarity, speed was supposed to be the side effect. Spamd could be just a heavy CGI script, if it didn't use its own protocol (design error IMO; it should have been HTTP). Now, if you have a CGI script, you don't write your own code to read data from network and manage children, but rather leave it to a HTTP server; don't you? That was where the performance gain comes from, I believe. > Did we learn anything here that could more > appropriately be applied to spamd instead of relying on apache? Yes, to reuse code. Unfortunately, the SA dev team believes in reinventing the wheel in the name of easy installation. ;-) There is strong resistance against creating new dependencies; see bug #4964 for an example. No, the speed optimisations from spamd@apache, which might be applied to standalone spamd are very minor. I suspect that working on the child management code might give good results. IMO, the right way is to switch to something like Net::Server (and optimise it if needed). > In the test results presented in comment #3 we see that the prefork method was more efficient than the > worker thread. Did you try running spamd with --round-robin to see if similar results are found there? No, you're welcome to do that. :-) I doubt it would change much. > Looking through the RC2 code I'm wondering if you have a short list of things you think you'd do > differently if you felt inclined to rework this. spamd@apache: there is plenty of room for improvement, but I believe it's a solid code base. I have reworked it back and forth for last two months, no need to do it again. spamd: if the SA developers are positive about it, I'll patch it to use Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config. I'd kill the custom protocol and switch to HTTP, but that won't be done (discussed in comment #6). SpamAssassin: I'd kill all the performance hacks and introduce strict OO interface. I believe that would get a major performance gain in the long run. Also, config namespaces would be useful; something like: my $sa = Mail::SpamAssassin->new; my $user_config = Mail::SpamAssassin::Config->new(data => $lines); my $result = $sa->parse->check(add_config => [ $user_config ]); Then, the it could be checked in this order: @{ $self->{add_config} } site config defaults > RE: comment #5. For whatever weight my newbie status carries I think this should be a separate > package at this point. For one it requires a host of new dependencies to run. Also, there's simply no > way it's as stable as it will be in a few months. If it's not integrated, it will never be used and stable. It can live in its own directory and be installed only if mod_perl is available. Yes, it's probably buggy; I have not received a single success or failure report, so probably has only been tested by me.
It is worth noting that many times when we've relied on third-party modules, the results have been *more* inefficient, slow, and buggy than writing our own code. ;) It can also create situations where bugs in the third-party modules have appeared that would have never been an issue in our own code -- for example, the current issue with Text::Wrap, which appears to be throwing fatal errors when SA runs it to generate reports. Code reuse via third-party modules is certainly not always a good idea for Spamassassin; there's no point attempting to reuse code if the code quality is not good enough, or the code just won't fit. (btw: Net::Server *was* considered for use when preforking was being worked on, but it was buggy, iirc. I note that it still doesn't pass its tests on win32, according to http://seamons.com/projects/perl/net_server/net_server_preforksimple.html ...)
oh, btw, config namespaces: bug 3852
hey, I forgot to comment on this bit: > spamd: if the SA developers are positive about it, I'll patch it to use > Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config. I'm definitely positive on that general plan, and I think the other committers are too. In general, much of spamd should be abstracted out into modules in the Mail::SpamAssassin:: namespace. In particular, all the stuff that deals with wierd stuff like vpopmail should be implemented as OO modules, or plugins, implementing those mechanisms, instead of being in the monolithic spamd script. (Much of that can't be shared with Apache-spamd anyway, though, since it often requires setuid'ing and I don't think apache-spamd does that?)
FWIW I tried playing with this last week. I installed apache and mod_perl and then tried to run this fancy new version of spamd. Surprise, surprise, I ran into problems with dependencies. Apparently my mod_perl gets pushed into some funky path relative to your machine (I'm on FreeBSD, installing things out of ports). I modified @INC in a few places and kept moving on to new problems. I eventually gave up. To be quite blunt I don't think this feature has any business in an official release of SA. If you can't convince some early adopters to download this package all by itself and test it then it simply shouldn't be shipped with an official release (perhaps performance simply isn't a problem for the vast majority of users?)
Created attachment 3676 [details] patch on top of RC2 I've applied patch 3656, RC2, to svn trunk as the subdir 'spamd-apache2'. It's basically the same as the contents of RC2, with this patch applied. it's now in as r438116
marking fixed, since it's in SVN! further issues should be opened as new bugs. Bob: feel free to open additional bug/s, or bring it up on the users list.
(In reply to comment #14) > Surprise, surprise, I ran into problems with dependencies. Apparently my mod_perl gets pushed into > some funky path relative to your machine (I'm on FreeBSD, installing things out of ports). I modified > @INC in a few places and kept moving on to new problems. I eventually gave up. I need more info to fix the problem (assuming the problem lies within apache-spamd and your mod_perl2 is installed properly). Could you send (preferably to me directly) the output of "perl Makefile.PL; make test"?
(In reply to comment #13) > > spamd: if the SA developers are positive about it, I'll patch it to use > > Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config. > I'm definitely positive on that general plan, and I think the other > committers are too. Then it will be done. > In general, much of spamd should be abstracted out into modules in the > Mail::SpamAssassin:: namespace. In particular, all the stuff that deals with > wierd stuff like vpopmail should be implemented as OO modules, or plugins, > implementing those mechanisms, instead of being in the monolithic spamd script. vpopmail... How about removing every single reference to it, followed by svn ci -m 'rm vpopmail, get a real MTA'? > (Much of that can't be shared with Apache-spamd anyway, though, since it often > requires setuid'ing and I don't think apache-spamd does that?) No it doesn't. Not possible without really ugly hacks.