Bug 5293 - RFE: pluginize the Bayes subsystem
Summary: RFE: pluginize the Bayes subsystem
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 5686
  Show dependency tree
 
Reported: 2007-01-11 06:09 UTC by Justin Mason
Modified: 2008-01-18 02:28 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
svn diff for review patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-01-11 06:09:56 UTC
as referred to in bug 5291 and bug 4876, a long-term goal has been a plugin API
for learning classifiers, of which the built-in Bayes code is just one.

Currently that code (in Mail::SpamAssassin::Bayes) is deeply hooked into
the rest of the core codebase.  It would be much better to make it pluginized,
so that:

a) other probabilistic classifiers can replace/augment it (bug 4876)

b) people who don't want Bayes, like massive ISPs, can turn it off and save
several MBs of per-child memory (although current code should be quite good
about this, since it's lazy-loaded)
Comment 1 Justin Mason 2007-10-12 10:16:45 UTC
I'm going to do this, since I'm planning to investigate an alternative pluggable
classifier for 3.3.0...
Comment 2 Michael Parker 2007-10-12 10:21:14 UTC
Would you mind posting your hooks and plans as you go along (a full blow spec if
you feel like it :))?  I'm very keen on this as well.

Thanks
Comment 3 Justin Mason 2007-10-12 10:51:26 UTC
(In reply to comment #2)
> Would you mind posting your hooks and plans as you go along (a full blow spec if
> you feel like it :))?  I'm very keen on this as well.

full-blown spec!  not a hope ;)   here's the hooks I've got so far, though:

sanity_check_bayes_is_untied
learn_message
forget_message
learner_sync
learner_expire_old_training
learner_is_scan_available
learner_dump_database
learner_get_implementation

They match up with the methods in Bayes.pm, pretty much.  So it's a very
high-level pluginization; I figure if different learner backends wind up sharing
code, it'll be easier to refactor and do that "behind the scenes", than try to
design it upfront into the plugin hooks, without having the benefit of writing
an alternative backend to learn from.  

Also, in my experience writing the first few drafts of Bayes.pm, the storage
requirements for different backends were radically different, so code sharing
may just not be possible :(

some notes on those hooks:

'sanity_check_bayes_is_untied': I'm not sure if this should still be exposed as
a hook, to be honest.  it probably needs a better name anyway.

'learner_get_implementation': this is definitely a hack; it's needed for the
test suite to work.  it probably doesn't need to be a public hook, since it can
be kept undocumented ;)

'learner_expire_old_training', 'learner_sync': backends for which this makes no
sense, can of course just "return 1" for these.

Comment 4 Justin Mason 2007-10-12 14:49:38 UTC
ok, check out
https://svn.apache.org/repos/asf/spamassassin/branches/bug-5293-pluginized-bayes
-- I've just checked in a first blast at this.  All the bayes*.t tests pass, so
it seems functional...
Comment 5 Justin Mason 2007-10-12 14:53:11 UTC
regarding the memory-footprint reduction when the Bayes plugin is off: here's ps
axuww output with bayes on:

jm        7651  40732:36992   0 pts/7     0:00           spamd child

and with the Bayes plugin commented:

jm        7678  39500:35812   0 pts/7     0:00           spamd child

not exactly massive.  still, there's probably ways to reduce that further...
Comment 6 Justin Mason 2007-12-10 06:42:31 UTC
I've now merged this up to current SVN trunk's HEAD, and added documentation to
Plugin.pm for the new APIs.  There are a few minor updates:

- 'sanity_check_bayes_is_untied': I renamed this to the much clearer
  'learner_close', which is, it turns out, what it should be used for. ;) It's
  just our basic impl that optimises things to such an extent that it's
  actually used as just a sanity check.

- 'learner_get_implementation' is less thorny now, since it is only used in the
  bayes-specific test cases.

- the {bayes} member in the Mail::SpamAssassin::BayesStore classes is now a ref
  to a Mail::SpamAssassin::Plugin::Bayes object, not a
  Mail::SpamAssassin::Bayes.

- there are duplicates of the Mail::SpamAssassin::BayesStore classes under
  Mail/SpamAssassin/OSBF/Store, used to provide storage for the experimental
  OSBF classifier (bug 5686).  if we get OSBF working ok, it would be nice
  to refactor to share more code.  no biggie.

- there are config settings for the new, experimental OSBF code commented
  out in Conf.pm.

Otherwise it's unchanged.  I propose to merge this into trunk and get rid
of this branch... speak up if you're not keen ;)

I'll attach the changes in patch form for review.

Comment 7 Justin Mason 2007-12-10 06:43:15 UTC
Created attachment 4203 [details]
svn diff for review

this is a (slightly rearranged) copy of

svn diff https://svn.apache.org/repos/asf/spamassassin/trunk
https://svn.apache.org/repos/asf/spamassassin/branches/bug-5293-pluginized-bayes
Comment 8 Justin Mason 2008-01-18 02:28:49 UTC
I'd forgotten about this -- I thought it was already in 3.3.0!  so I've merged
up to latest trunk HEAD, again, as of r613117; and then merged the Bayes-only
parts of that, omitting the bug 5686 (OSBF) parts, into trunk.  That's now in
trunk as

: jm 130...; svn commit -m "bug 5923: pluginize the Bayes subsystem, allowing
replacement of our probabilistic classifier with alternative learning plugins. 
Merged from the bug-5293-pluginized-bayes branch"
Sending        lib/Mail/SpamAssassin/Bayes.pm
Sending        lib/Mail/SpamAssassin/BayesStore/DBM.pm
Sending        lib/Mail/SpamAssassin/BayesStore/SQL.pm
Sending        lib/Mail/SpamAssassin/BayesStore.pm
Sending        lib/Mail/SpamAssassin/Conf.pm
Sending        lib/Mail/SpamAssassin/PerMsgLearner.pm
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Sending        lib/Mail/SpamAssassin/Plugin/Bayes.pm
Sending        lib/Mail/SpamAssassin/Plugin/WLBLEval.pm
Sending        lib/Mail/SpamAssassin/Plugin.pm
Sending        lib/Mail/SpamAssassin.pm
Sending        masses/bayes-testing/bayes-10pcv-driver
Sending        rules/23_bayes.cf
Sending        t/bayesdbm.t
Sending        t/bayesdbm_flock.t
Sending        t/bayessdbm.t
Sending        t/bayessdbm_seen_delete.t
Sending        t/bayessql.t
Transmitting file data ..................
Committed revision 613124.