SA Bugzilla – Bug 5293
RFE: pluginize the Bayes subsystem
Last modified: 2008-01-18 02:28:49 UTC
as referred to in bug 5291 and bug 4876, a long-term goal has been a plugin API for learning classifiers, of which the built-in Bayes code is just one. Currently that code (in Mail::SpamAssassin::Bayes) is deeply hooked into the rest of the core codebase. It would be much better to make it pluginized, so that: a) other probabilistic classifiers can replace/augment it (bug 4876) b) people who don't want Bayes, like massive ISPs, can turn it off and save several MBs of per-child memory (although current code should be quite good about this, since it's lazy-loaded)
I'm going to do this, since I'm planning to investigate an alternative pluggable classifier for 3.3.0...
Would you mind posting your hooks and plans as you go along (a full blow spec if you feel like it :))? I'm very keen on this as well. Thanks
(In reply to comment #2) > Would you mind posting your hooks and plans as you go along (a full blow spec if > you feel like it :))? I'm very keen on this as well. full-blown spec! not a hope ;) here's the hooks I've got so far, though: sanity_check_bayes_is_untied learn_message forget_message learner_sync learner_expire_old_training learner_is_scan_available learner_dump_database learner_get_implementation They match up with the methods in Bayes.pm, pretty much. So it's a very high-level pluginization; I figure if different learner backends wind up sharing code, it'll be easier to refactor and do that "behind the scenes", than try to design it upfront into the plugin hooks, without having the benefit of writing an alternative backend to learn from. Also, in my experience writing the first few drafts of Bayes.pm, the storage requirements for different backends were radically different, so code sharing may just not be possible :( some notes on those hooks: 'sanity_check_bayes_is_untied': I'm not sure if this should still be exposed as a hook, to be honest. it probably needs a better name anyway. 'learner_get_implementation': this is definitely a hack; it's needed for the test suite to work. it probably doesn't need to be a public hook, since it can be kept undocumented ;) 'learner_expire_old_training', 'learner_sync': backends for which this makes no sense, can of course just "return 1" for these.
ok, check out https://svn.apache.org/repos/asf/spamassassin/branches/bug-5293-pluginized-bayes -- I've just checked in a first blast at this. All the bayes*.t tests pass, so it seems functional...
regarding the memory-footprint reduction when the Bayes plugin is off: here's ps axuww output with bayes on: jm 7651 40732:36992 0 pts/7 0:00 spamd child and with the Bayes plugin commented: jm 7678 39500:35812 0 pts/7 0:00 spamd child not exactly massive. still, there's probably ways to reduce that further...
I've now merged this up to current SVN trunk's HEAD, and added documentation to Plugin.pm for the new APIs. There are a few minor updates: - 'sanity_check_bayes_is_untied': I renamed this to the much clearer 'learner_close', which is, it turns out, what it should be used for. ;) It's just our basic impl that optimises things to such an extent that it's actually used as just a sanity check. - 'learner_get_implementation' is less thorny now, since it is only used in the bayes-specific test cases. - the {bayes} member in the Mail::SpamAssassin::BayesStore classes is now a ref to a Mail::SpamAssassin::Plugin::Bayes object, not a Mail::SpamAssassin::Bayes. - there are duplicates of the Mail::SpamAssassin::BayesStore classes under Mail/SpamAssassin/OSBF/Store, used to provide storage for the experimental OSBF classifier (bug 5686). if we get OSBF working ok, it would be nice to refactor to share more code. no biggie. - there are config settings for the new, experimental OSBF code commented out in Conf.pm. Otherwise it's unchanged. I propose to merge this into trunk and get rid of this branch... speak up if you're not keen ;) I'll attach the changes in patch form for review.
Created attachment 4203 [details] svn diff for review this is a (slightly rearranged) copy of svn diff https://svn.apache.org/repos/asf/spamassassin/trunk https://svn.apache.org/repos/asf/spamassassin/branches/bug-5293-pluginized-bayes
I'd forgotten about this -- I thought it was already in 3.3.0! so I've merged up to latest trunk HEAD, again, as of r613117; and then merged the Bayes-only parts of that, omitting the bug 5686 (OSBF) parts, into trunk. That's now in trunk as : jm 130...; svn commit -m "bug 5923: pluginize the Bayes subsystem, allowing replacement of our probabilistic classifier with alternative learning plugins. Merged from the bug-5293-pluginized-bayes branch" Sending lib/Mail/SpamAssassin/Bayes.pm Sending lib/Mail/SpamAssassin/BayesStore/DBM.pm Sending lib/Mail/SpamAssassin/BayesStore/SQL.pm Sending lib/Mail/SpamAssassin/BayesStore.pm Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/PerMsgLearner.pm Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Sending lib/Mail/SpamAssassin/Plugin/Bayes.pm Sending lib/Mail/SpamAssassin/Plugin/WLBLEval.pm Sending lib/Mail/SpamAssassin/Plugin.pm Sending lib/Mail/SpamAssassin.pm Sending masses/bayes-testing/bayes-10pcv-driver Sending rules/23_bayes.cf Sending t/bayesdbm.t Sending t/bayesdbm_flock.t Sending t/bayessdbm.t Sending t/bayessdbm_seen_delete.t Sending t/bayessql.t Transmitting file data .................. Committed revision 613124.