Bug 4603 - RFE: Apache::SpamD module, to run spamd from httpd
RFE: Apache::SpamD module, to run spamd from httpd
Status: RESOLVED FIXED
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd
3.1.0
Other other
: P5 enhancement
: 3.2.0
Assigned To: SpamAssassin Developer Mailing List
:
Depends on: 4963 4992 4994
Blocks: 4917
  Show dependency tree
 
Reported: 2005-09-26 12:06 UTC by Justin Mason
Modified: 2006-09-07 16:09 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Apache::SpamD text/plain None Michael Parker [HasCLA]
Early version application/octet-stream None Radoslaw Zielinski [HasCLA]
beta version application/octet-stream None Radoslaw Zielinski [HasCLA]
RC1 application/octet-stream None Radoslaw Zielinski [HasCLA]
RC2 application/octet-stream None Radoslaw Zielinski [HasCLA]
patch on top of RC2 patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2005-09-26 12:06:30 UTC
similar to Apache::Qpsmtpd

http://www.issociate.de/board/index.php?t=msg&th=79422&rid=0

I was talking to Justin Erenkrantz this weekend, and reportedly this is quite a
good win for qpsmtpd; it offloads all the work of talking TCP, processes,
signals, etc. to apache.

anyway, this bug is a tracker for anyone who might feel the need to take this on...
Comment 1 Michael Parker 2005-09-26 13:01:46 UTC
Created attachment 3147 [details]
Apache::SpamD

Here is Apache::SpamD which was a brief proof of concept I wrote at last years
ApacheCon.  It worked enough to return basic things.  I never really had the
energy to take it much further.

One key item is that to run something like this I think you're gonna have to
live with some sort of virtual user setup since Apache won't be switching
users.	No problem for me, one reason why I wanted to be able to write all data
to SQL tables.
Comment 2 Justin Mason 2006-05-26 10:25:36 UTC
This is now a project in the Google Summer of Code 2006 -- Radoslaw Zielinski
will be working on it, with support from Google!  welcome Radoslaw ;)
Comment 3 Radoslaw Zielinski 2006-07-16 13:00:04 UTC
Created attachment 3587 [details]
Early version

Attaching something to play with.  There is a lot to be done, logging and
use preferences to name the most important things.  But most of the tricky
stuff is figured out.  See perldoc apache-spamd.pl for more information.

I've benchmarked it with --max-children 5 --min-spare 4 --max-spare 5 -x
(spamd started with LC_ALL=C) using Bench-spamd.pl with -c 3; results:

 * standalone spamd:
parsed 2000 messages in 00:05:54 (354.753661 s),
     5.6377 msgs/s (338 msgs/min, 20296 msgs/h)
parsed 2000 messages in 00:05:56 (356.282878 s),
     5.6135 msgs/s (337 msgs/min, 20209 msgs/h)

 * worker:
parsed 2000 messages in 00:03:45 (225.887553 s),
     8.8540 msgs/s (531 msgs/min, 31874 msgs/h)
parsed 2000 messages in 00:03:52 (232.813661 s),
     8.5906 msgs/s (515 msgs/min, 30926 msgs/h)

 * prefork:
parsed 2000 messages in 00:03:12 (192.53676 s),
    10.3876 msgs/s (623 msgs/min, 37395 msgs/h)
parsed 2000 messages in 00:03:12 (192.775798 s),
    10.3747 msgs/s (622 msgs/min, 37349 msgs/h)
Comment 4 Radoslaw Zielinski 2006-07-25 13:16:28 UTC
Created attachment 3610 [details]
beta version

Here it is, follows standard instalation procedure (Makefile.PL).
Includes some basic tests and documentation.

I could use some feedback... :-)
Comment 5 Justin Mason 2006-07-26 18:17:08 UTC
wow, it's almost finished!  Nearly feature-complete, at least, from the looks of
things.  excellent!

I see you implemented the feature I requested in bug 4992, to write the
apache-spamd.pl config directives to a file; that's cool. thanks!

Using IPC::Open3 is a nightmare for portability, btw -- I'm pretty sure it
doesn't work on win32 at least -- but maybe there are other issues there anyway?

how does it compare to current spamd, in speed terms?

Regarding logging.  What's the issue?  (I couldn't actually spot any logging in
that tarball.)

Should it be integrated into the main distro, or kept as a separate module
with its own Makefile.PL, do you think?  (I think I'd prefer to integrate, if
possible.)

And finally, I think it could do with more documentation and tests ;)  a lot of
that would probably make more sense after the integration-into-distro question
is resolved (e.g. "what README does it go into"). 
Comment 6 Radoslaw Zielinski 2006-08-03 04:51:54 UTC
Some comments from the dev list:
http://www.nabble.com/forum/ViewPost.jtp?post=5517464&framed=y
Comment 7 Radoslaw Zielinski 2006-08-08 17:39:57 UTC
Created attachment 3628 [details]
RC1

- implemented logging
- implemented TELL
- separated Mail::SpamAssassin::Spamd
- switched to Apache::Test

NetSet, timeout (I'm not sure how does Apache behave with SIGALRM with misc
MPMs), some more benchmarks and documentation left.
Comment 8 Radoslaw Zielinski 2006-08-17 23:17:28 UTC
Created attachment 3656 [details]
RC2

- --allowed-ips (without using NetSet after all; supports IPv6)
- timeout
- used `shell` instead of IPC::Run
- tweaks & fixes

Well... mission complete?
Comment 9 Bob Van Zant 2006-08-21 23:27:09 UTC
I'm curious to hear Radoslaw's comments as to why he thinks this is faster than running spamd the 
traditional way. What sorts of bottlenecks did this clear? Did we learn anything here that could more 
appropriately be applied to spamd instead of relying on apache?

In the test results presented in comment #3 we see that the prefork method was more efficient than the 
worker thread. Did you try running spamd with --round-robin to see if similar results are found there?

Looking through the RC2 code I'm wondering if you have a short list of things you think you'd do 
differently if you felt inclined to rework this.

RE: comment #5. For whatever weight my newbie status carries I think this should be a separate 
package at this point. For one it requires a host of new dependencies to run. Also, there's simply no 
way it's as stable as it will be in a few months.
Comment 10 Radoslaw Zielinski 2006-08-22 12:07:12 UTC
(In reply to comment #9)
> I'm curious to hear Radoslaw's comments as to why he thinks this is faster
than running spamd the 
> traditional way. What sorts of bottlenecks did this clear?

I didn't clear any obvious bottlenecks.

Spamd code has evolved for years, patch after patch without a large general
cleanup, so it became rather messy.  Since I had the luxury of having a
reference implementation to work with, I could plan how do I want it organised
with a more general view.

I have moved code around, been very careful about copying data in memory,
avoided few regexps / assignments / control statements... but that gave very
little speed up, if any.  I have focused on correctness and clarity, speed
was supposed to be the side effect.

Spamd could be just a heavy CGI script, if it didn't use its own protocol
(design error IMO; it should have been HTTP).  Now, if you have a CGI script,
you don't write your own code to read data from network and manage children,
but rather leave it to a HTTP server; don't you?  That was where the
performance gain comes from, I believe.

> Did we learn anything here that could more 
> appropriately be applied to spamd instead of relying on apache?

Yes, to reuse code.  Unfortunately, the SA dev team believes in reinventing
the wheel in the name of easy installation. ;-)  There is strong resistance
against creating new dependencies; see bug #4964 for an example.

No, the speed optimisations from spamd@apache, which might be applied to
standalone spamd are very minor.

I suspect that working on the child management code might give good results.
IMO, the right way is to switch to something like Net::Server (and optimise
it if needed).

> In the test results presented in comment #3 we see that the prefork method was
more efficient than the 
> worker thread. Did you try running spamd with --round-robin to see if similar
results are found there?

No, you're welcome to do that. :-)  I doubt it would change much.

> Looking through the RC2 code I'm wondering if you have a short list of things
you think you'd do 
> differently if you felt inclined to rework this.

spamd@apache: there is plenty of room for improvement, but I believe it's
a solid code base.  I have reworked it back and forth for last two months,
no need to do it again.

spamd: if the SA developers are positive about it, I'll patch it to use
Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config.
I'd kill the custom protocol and switch to HTTP, but that won't be done
(discussed in comment #6).         

SpamAssassin: I'd kill all the performance hacks and introduce strict OO
interface.  I believe that would get a major performance gain in the long
run.  Also, config namespaces would be useful; something like:

my $sa = Mail::SpamAssassin->new;
my $user_config = Mail::SpamAssassin::Config->new(data => $lines);
my $result = $sa->parse->check(add_config => [ $user_config ]);

Then, the it could be checked in this order:
  @{ $self->{add_config} }
  site config
  defaults

> RE: comment #5. For whatever weight my newbie status carries I think this
should be a separate 
> package at this point. For one it requires a host of new dependencies to run.
Also, there's simply no 
> way it's as stable as it will be in a few months.

If it's not integrated, it will never be used and stable.  It can live in
its own directory and be installed only if mod_perl is available.

Yes, it's probably buggy; I have not received a single success or failure
report, so probably has only been tested by me.
Comment 11 Justin Mason 2006-08-22 16:56:35 UTC
It is worth noting that many times when we've relied on third-party modules, the
results have been *more* inefficient, slow, and buggy than writing our own code.
;)  It can also create situations where bugs in the third-party modules have
appeared that would have never been an issue in our own code -- for example, the
current issue with Text::Wrap, which appears to be throwing fatal errors when SA
runs it to generate reports.

Code reuse via third-party modules is certainly not always a good idea for
Spamassassin; there's no point attempting to reuse code if the code quality is
not good enough, or the code just won't fit.

(btw: Net::Server *was* considered for use when preforking was being worked on,
but it was buggy, iirc.  I note that it still doesn't pass its tests on win32,
according to
http://seamons.com/projects/perl/net_server/net_server_preforksimple.html ...)
Comment 12 Justin Mason 2006-08-23 12:15:23 UTC
oh, btw, config namespaces: bug 3852
Comment 13 Justin Mason 2006-08-29 16:35:07 UTC
hey, I forgot to comment on this bit:

> spamd: if the SA developers are positive about it, I'll patch it to use
> Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config.

I'm definitely positive on that general plan, and I think the other
committers are too.

In general, much of spamd should be abstracted out into modules in the
Mail::SpamAssassin:: namespace. In particular, all the stuff that deals with
wierd stuff like vpopmail should be implemented as OO modules, or plugins,
implementing those mechanisms, instead of being in the monolithic spamd script.

(Much of that can't be shared with Apache-spamd anyway, though, since it often
requires setuid'ing and I don't think apache-spamd does that?)
Comment 14 Bob Van Zant 2006-08-29 16:48:23 UTC
FWIW I tried playing with this last week. I installed apache and mod_perl and then tried to run this fancy 
new version of spamd.

Surprise, surprise, I ran into problems with dependencies. Apparently my mod_perl gets pushed into 
some funky path relative to your machine (I'm on FreeBSD, installing things out of ports). I modified 
@INC in a few places and kept moving on to new problems. I eventually gave up.

To be quite blunt I don't think this feature has any business in an official release of SA. If you can't 
convince some early adopters to download this package all by itself and test it then it simply shouldn't 
be shipped with an official release (perhaps performance simply isn't a problem for the vast majority of 
users?)
Comment 15 Justin Mason 2006-08-29 17:08:02 UTC
Created attachment 3676 [details]
patch on top of RC2

I've applied patch 3656, RC2, to svn trunk as the subdir 'spamd-apache2'.  It's
basically the same as the contents of RC2, with this patch applied.  it's now
in as 1
Comment 16 Justin Mason 2006-09-05 15:07:47 UTC
marking fixed, since it's in SVN!  further issues should be opened as new bugs.

Bob: feel free to open additional bug/s, or bring it up on the users list.
Comment 17 Radoslaw Zielinski 2006-09-07 23:04:23 UTC
(In reply to comment #14)
> Surprise, surprise, I ran into problems with dependencies. Apparently my
mod_perl gets pushed into 
> some funky path relative to your machine (I'm on FreeBSD, installing things
out of ports). I modified 
> @INC in a few places and kept moving on to new problems. I eventually gave up.

I need more info to fix the problem (assuming the problem lies within
apache-spamd and your mod_perl2 is installed properly).  Could you send
(preferably to me directly) the output of "perl Makefile.PL; make test"?
Comment 18 Radoslaw Zielinski 2006-09-07 23:09:41 UTC
(In reply to comment #13)
> > spamd: if the SA developers are positive about it, I'll patch it to use
> > Mail::SpamAssassin::Spamd and Mail::SpamAssassin::Spamd::Config.
> I'm definitely positive on that general plan, and I think the other
> committers are too.

Then it will be done.

> In general, much of spamd should be abstracted out into modules in the
> Mail::SpamAssassin:: namespace. In particular, all the stuff that deals with
> wierd stuff like vpopmail should be implemented as OO modules, or plugins,
> implementing those mechanisms, instead of being in the monolithic spamd script.

vpopmail...  How about removing every single reference to it, followed by
svn ci -m 'rm vpopmail, get a real MTA'?

> (Much of that can't be shared with Apache-spamd anyway, though, since it often
> requires setuid'ing and I don't think apache-spamd does that?)

No it doesn't.  Not possible without really ugly hacks.