Bug 3078 - RFE: "Dobly" noise reduction
RFE: "Dobly" noise reduction
Status: NEW
Product: Spamassassin
Classification: Unclassified
Component: Learner
SVN Trunk (Latest Devel Version)
Other other
: P5 enhancement
: Future
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks: 4560
  Show dependency tree
 
Reported: 2004-02-24 11:56 UTC by Justin Mason
Modified: 2007-10-11 15:02 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
BNR patch application/octet-stream None Daniel Dai [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-02-24 11:56:57 UTC
http://www.nuclearelephant.com/projects/dspam/dobly.html

This looks thoroughly implementable -- and probably useful.
Comment 1 Henry Stern 2004-02-24 12:39:50 UTC
It does look very interesting.  I can't say that I trust his reported results, 
though.

My PhD supervisor reccomended that I do an evaluation of it.

Here is my plan:
Assemble a corpus of messages containing a bootstrap training set and a test 
set with the noisy tokens manually tagged.

Evaluate two things:

1. Compare the raw performance of a naive Bayesian classifier using and not 
using Dobly.

2. Evaluate how accurately Dobly removes the noise.

Another experiment that I would like to do:
Take a spam corpus without noise.  Artificially inject noise into the spam and 
evaluate how effectively Dobly removes the noise.  Also, evaluate the 
classification performance difference.

I will need some volunteers to help tag and parse the messages.

Any takers?
Comment 2 Jonathan Zdziarski 2005-01-16 19:59:19 UTC
I caught this URL from a referer in a web report. Version 2.0 of this algorithm
is now available at http://bnr.nuclearelephant.com if this is still of interest.
2.0 is a purely statistical implementation of this, complete with a GPL shared
library implementation.
Comment 3 Carl P. Corliss 2005-01-23 11:29:05 UTC
Just curios if this has any chance of being added - it would make life a bit
easier with amavisd-new if it was handled at a much lower level...... :-)
Comment 4 Daniel Quinlan 2005-03-30 01:08:35 UTC
move bug to Future milestone (previously set to Future -- I hope)
Comment 5 Justin Mason 2006-05-26 10:17:27 UTC
This was a suggested idea for the Google Summer of Code 2006;
I'm adding it to the bugzilla for future use, and in case anyone feels
like implementing it.

Subject ID: spamassassin-dobly
Keywords: dobly, bayes, classifiers, perl
Description: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3078 :
investigate "Dobly" noise reduction a la http://bnr.nuclearelephant.com/ , in a
form that can be incorporated into SpamAssassin.  Benchmark results using
10-fold cross-validation.
Possible Mentors: Justin Mason (jm at jmason.org)
Comment 6 Daniel Dai 2007-06-14 23:23:56 UTC
Hi, all,
I will implement this Bayesian Noise Reduction (BNR) module as my google summer
code project. If you do not yet know what BNR is, please refer to Jonathan’s
original paper (http://bnr.nuclearelephant.com/). Here I will present my design
and wait for your opinion. Surely there is couple of design choices. My choice
is based the fact that BNR is only needed by bayes learner, and my primary goal
is to keep configuration simpler and keep performance penalty low. 

Here is my general design for your review:

1.	Where to hook the BNR filter?
BNR works for a better bayes performance. So I will restrict BNR inside bayes
module. Other modules will not see the result after BNR filtering. 
Before a message pass to bayes scanner, it was parsed into several parts.
Visible body (all text of text/plain and visible part of text/html, html tag
stripped), invisible body (hidden or “hard to see” part of text/html, html tag
stripped), url list and all headers will be presented to bayes scanner. Visible
body is a list of words which are visible to user in original order. BNR
consistency check, which check the consistency of each words with its
neighboring words, will be helpful to find out of context words. Whether or not
BNR is helpful to invisible body is doubtful, since invisible body usually
consists of short sentences or phrases scatter around the email. And other two
other parts, url and heads are not applicable for BNR filtering. I will add BNR
filtering code after tokenize visible body (and invisible body?). Then bayes
learner will only see purified tokens. Another note is that in Jonanthan’s
original work, html tag is not removed before BNR. SA is different, bayes in SA
do not deal with html tag.
Context pattern learning process will be added to bayes learning process. There
will not be a separate context pattern learning process. Every time bayes update
token counts (sa-learn or auto learn), it will also update context patterns. And
context forget process is stick to token forget process too.
There will be no expiry on context pattern, since at most 20*20*20=8000 rows
will be in the database. Every time we purge, backup, restore, dump bayes
database, we do the same thing on context patterns as well. 

2.	How to perform the noise reduction?
Thanks for the open source release of libbnr by Jonanthan, the core BNR
algorithm in C is about 80 lines. It should be easily adapted to SA. The tricky
part is how to pass context pattern and token statistics to BNR.

3.	What are the new configuration items?
I will try to keep the number of new params small. Here are the new params:
•	 bnr_window_radius (default: 0.25), the window around 0.5 which BNR thinks 
not interesting. Default 0.25 means BNR will not care about the context 
which have a p-value between 0.25--0.75
•	bnr_token_radius (default: 0.3), maximum distance BNR will tolerate,
otherwise, it is filter out. Eg, we have a window with p-values [0.10 0.60
0.70], and p-value of this context is 0.5, then the first word is filtered out
•	use_bnr, whether or not to use bnr
•	bayes_min_ham_num / bayes_min_spam_num, it is an existing bayes configuration,
for simplicity, BNR will use it for the starting point to learn. It is because
BNR can not start after bayes token values are becoming relatively stable.
•	bnr_min_ham_num / bnr_min_spam_num, when BNR will get involved in scan

4.	Which modules will be affected?
•SQL: new context pattern table 
   CREATE TABLE context_pattern (
     id int(11) NOT NULL default '0',
     pattern char(11) NOT NULL default '',
     spam_count int(11) NOT NULL default '0',
     ham_count int(11) NOT NULL default '0',
     PRIMARY KEY  (id, token),
)
•BayesStore, include
    o	BayesStore.pm (interface)
    o	BayesStore::DBM.pm (DBM implementation)
    o	BayesStore::SQL.pm (general SQL implementation)
    o	BayesStore::MySQL.pm (MySQL)
    o	BayesStore::PgSQL.pm (PostgreSQL)
    Modified function:	
    o	clear_database (clear context pattern also)
    o	backup_database (backup context pattern also)
    o	restore_database (restore context pattern also)
    o	dump_db_toks (dump context pattern also)
    New function:
    o	multi_context_count_change (save multiple context patterns)
    o	_put_context_pattern (save context pattern)
    o	context_get (retrieve context pattern)
    o	context_get_all (retrieve multiple context patterns)

    •Bayes.pm, sub tokenize
    Add various code right after the body part of a message has been tokenized,
including scan, learn, forget. It sounds more proper to me to add a new plugin,
such like "after_bayes_tokenize", add add all related code to this new plugin.
Core BNR algorithm will be integrated into this part.

Any comments and opinions are appreciated.

Thank you
Jianyong Dai
Comment 7 Daniel Dai 2007-08-20 21:03:22 UTC
Created attachment 4106 [details]
BNR patch

This package include patches for BNR module target to SpamAssassin 3.2.0
Comment 8 Jonathan Zdziarski 2007-08-20 21:09:32 UTC
Is libbnr being used? If so, I was under the impression that the GPL was incompatible with the Apache 
license.
Comment 9 Daniel Dai 2007-08-20 21:15:19 UTC
I didn't use libbnr, I implement this module in perl. Thank you
Comment 10 Daniel Dai 2007-08-20 21:36:39 UTC
I have finished Bayes Noise Reduction module. I attached patch files on
bugzilla. And you can find documentation and performance data at
http://docs.google.com/Doc?id=dfsk849w_13d4zm72

Any comments are welcome.

Thank you
Jianyong Dai
Comment 11 Justin Mason 2007-10-11 15:02:33 UTC
Thanks for the code!

The test results seem quite bad, unfortunately :(  It might be worth testing it
with more recent spam, however; the stuff in the public corpus is quite old, and
dates from before widespread use of "Bayes poisoning" text in spam.