SA Bugzilla – Bug 2037
RFE: SQL efficiency: keep persistent DB connections (in code with a good license)
Last modified: 2019-07-08 10:26:46 UTC
Every spamd instance (fork) requires a new connection to be created to the SQL server, then a query happens, and then the session is disconnected. This is quite unnecessary and inefficient. Spamd should actually connect on startup, disconnect on shutdown, and just do queries through that open connection. Though, I realise this is hard when spamd is forking all over the place (oh for perl to have threads!). One possible way to do this would be instead would be to fork a new process just for SQL communications, and then when you start a new spamd process, get it to communicate with the SQL thread for its communications. That, or fetch user preferences BEFORE the fork is performed. In any case, this can lead to significant load on a system processing a lot of mail, due to the load not only of a bunch of spamd processes running (as I said, oh for threads), but also due to a process or thread being launched by the SQL server to accomodate these new connections. Then when you bring in table locking (for MySQL databases), this can indeed slow things down. Something to consider for future revisions. I like the fact that spamassassin intergrates with SQL databases, I find it much easier to manage users via. a SQL database than using files on disk - especially when my users manage spamassassin configurations via. a web-based mail client (so everything in this case is running as the 'www' user), so it saves using FTP to transport around files, or duplicating configurations, leaving 'flag' files about the place, etc.
Just adding a comment before I forget about it. In the absence of a pre-fork server that can have persistent database connections DBD::Proxy might be an option. I did a very small about of playing with it today and it at least worked, so it might be fairly easy to get it to cache some db connections for us. It has a pre-fork/threaded server mode already built in and knows how to handle cached_connect calls.
spamd is now uses a prefork model (as of r10000). :)
I noticed in the new 3.0.0 version, there are two 'SQL conf' files, namely: http://spamassassin.rediris.es/full/3.0.x/dist/lib/Mail/SpamAssassin/ConfSourceSQL.pm and http://spamassassin.rediris.es/full/3.0.x/dist/lib/Mail/SpamAssassin/Conf/SQL.pm What I find interesting is neither stores the DB handle in something like $self->{dbh}, and then later checks to see if its 'active'. Whereas, the SQL support for the bayesan stuff here: http://spamassassin.rediris.es/full/3.0.x/dist/lib/Mail/SpamAssassin/BayesStore/SQL.pm seems to do precisely that. All it would conceivably take would be to replace: my $dbh = DBI->connect($dsn, $dbuser, $dbpass, {'PrintError' => 0}); with: if (!defined($self->{dbh})) { $self->{dbh} = DBI->connect($dsn, $dbuser, $dbpass, {'PrintError' => 0}); } And then replace any reference to $dbh with $self->{dbh}. And of course removing the disconnect. Even better would be to decouple the DB connection and disconnection routines alltogether (as is done in the bayesan stuff). Even better STILL would be to have all DB functions in the same 'pre-forked process' use the same DB handle (regardless of whether its for bayesan, configuration, or other purposes). ie. put the dbh in the 'main' section (which is where the SQL config stuff is read from anyway). But the bare minimum of stopping the continual re-connects would be a huge benefit anyway.
moving to 3.1
sql -> assigning to michael
move bug to Future milestone (previously set to Future -- I hope)
I've written a plugin that handles persistent database connection and it can be found here: http://wiki.apache.org/spamassassin/DBIPlugin Please note that it requires SpamAssassin 3.1+ to work correctly. Closing, this will hopfully eventually make it into the main SpamAssassin codebase.
reopening -- we could still do with a plugin that doesn't use DBI due to its licensing problems.
This was a suggested idea for the Google Summer of Code 2006; I'm adding it to the bugzilla for future use, and in case anyone feels like implementing it. Subject ID: spamassassin-persistent-db-conns Keywords: perl, databases, sql Description: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2037 : persistent database connections for SpamAssassin's Bayes subsystem. Michael: 'This exists, but is not an ASL friendly license. So a "clean room" implementation might be cool.' Possible Mentors: Michael Parker (parkerm -at- pobox.com)
Back to dev
Created attachment 4105 [details] My Persistent Database connection plugin This is a plugin written during the period of Google Summer Code 2007. I have done the benchmark with MySQL and PostgreSQL.
The only real thing this needs is to turn the $CONNECT variable into a hash so that we can have it handle multiple connection strings, but that is a very minor change.
Created attachment 4111 [details] the plugin code has been updated
Closing old stale bug. Not sure if SQL is already persistent to some degree, but Redis should be used for anything high volume anyway.