SA Bugzilla – Bug 3078
RFE: "Dobly" noise reduction
Last modified: 2022-03-06 16:25:03 UTC
http://www.nuclearelephant.com/projects/dspam/dobly.html This looks thoroughly implementable -- and probably useful.
It does look very interesting. I can't say that I trust his reported results, though. My PhD supervisor reccomended that I do an evaluation of it. Here is my plan: Assemble a corpus of messages containing a bootstrap training set and a test set with the noisy tokens manually tagged. Evaluate two things: 1. Compare the raw performance of a naive Bayesian classifier using and not using Dobly. 2. Evaluate how accurately Dobly removes the noise. Another experiment that I would like to do: Take a spam corpus without noise. Artificially inject noise into the spam and evaluate how effectively Dobly removes the noise. Also, evaluate the classification performance difference. I will need some volunteers to help tag and parse the messages. Any takers?
I caught this URL from a referer in a web report. Version 2.0 of this algorithm is now available at http://bnr.nuclearelephant.com if this is still of interest. 2.0 is a purely statistical implementation of this, complete with a GPL shared library implementation.
Just curios if this has any chance of being added - it would make life a bit easier with amavisd-new if it was handled at a much lower level...... :-)
move bug to Future milestone (previously set to Future -- I hope)
This was a suggested idea for the Google Summer of Code 2006; I'm adding it to the bugzilla for future use, and in case anyone feels like implementing it. Subject ID: spamassassin-dobly Keywords: dobly, bayes, classifiers, perl Description: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3078 : investigate "Dobly" noise reduction a la http://bnr.nuclearelephant.com/ , in a form that can be incorporated into SpamAssassin. Benchmark results using 10-fold cross-validation. Possible Mentors: Justin Mason (jm at jmason.org)
Hi, all, I will implement this Bayesian Noise Reduction (BNR) module as my google summer code project. If you do not yet know what BNR is, please refer to Jonathan’s original paper (http://bnr.nuclearelephant.com/). Here I will present my design and wait for your opinion. Surely there is couple of design choices. My choice is based the fact that BNR is only needed by bayes learner, and my primary goal is to keep configuration simpler and keep performance penalty low. Here is my general design for your review: 1. Where to hook the BNR filter? BNR works for a better bayes performance. So I will restrict BNR inside bayes module. Other modules will not see the result after BNR filtering. Before a message pass to bayes scanner, it was parsed into several parts. Visible body (all text of text/plain and visible part of text/html, html tag stripped), invisible body (hidden or “hard to see” part of text/html, html tag stripped), url list and all headers will be presented to bayes scanner. Visible body is a list of words which are visible to user in original order. BNR consistency check, which check the consistency of each words with its neighboring words, will be helpful to find out of context words. Whether or not BNR is helpful to invisible body is doubtful, since invisible body usually consists of short sentences or phrases scatter around the email. And other two other parts, url and heads are not applicable for BNR filtering. I will add BNR filtering code after tokenize visible body (and invisible body?). Then bayes learner will only see purified tokens. Another note is that in Jonanthan’s original work, html tag is not removed before BNR. SA is different, bayes in SA do not deal with html tag. Context pattern learning process will be added to bayes learning process. There will not be a separate context pattern learning process. Every time bayes update token counts (sa-learn or auto learn), it will also update context patterns. And context forget process is stick to token forget process too. There will be no expiry on context pattern, since at most 20*20*20=8000 rows will be in the database. Every time we purge, backup, restore, dump bayes database, we do the same thing on context patterns as well. 2. How to perform the noise reduction? Thanks for the open source release of libbnr by Jonanthan, the core BNR algorithm in C is about 80 lines. It should be easily adapted to SA. The tricky part is how to pass context pattern and token statistics to BNR. 3. What are the new configuration items? I will try to keep the number of new params small. Here are the new params: • bnr_window_radius (default: 0.25), the window around 0.5 which BNR thinks not interesting. Default 0.25 means BNR will not care about the context which have a p-value between 0.25--0.75 • bnr_token_radius (default: 0.3), maximum distance BNR will tolerate, otherwise, it is filter out. Eg, we have a window with p-values [0.10 0.60 0.70], and p-value of this context is 0.5, then the first word is filtered out • use_bnr, whether or not to use bnr • bayes_min_ham_num / bayes_min_spam_num, it is an existing bayes configuration, for simplicity, BNR will use it for the starting point to learn. It is because BNR can not start after bayes token values are becoming relatively stable. • bnr_min_ham_num / bnr_min_spam_num, when BNR will get involved in scan 4. Which modules will be affected? •SQL: new context pattern table CREATE TABLE context_pattern ( id int(11) NOT NULL default '0', pattern char(11) NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', PRIMARY KEY (id, token), ) •BayesStore, include o BayesStore.pm (interface) o BayesStore::DBM.pm (DBM implementation) o BayesStore::SQL.pm (general SQL implementation) o BayesStore::MySQL.pm (MySQL) o BayesStore::PgSQL.pm (PostgreSQL) Modified function: o clear_database (clear context pattern also) o backup_database (backup context pattern also) o restore_database (restore context pattern also) o dump_db_toks (dump context pattern also) New function: o multi_context_count_change (save multiple context patterns) o _put_context_pattern (save context pattern) o context_get (retrieve context pattern) o context_get_all (retrieve multiple context patterns) •Bayes.pm, sub tokenize Add various code right after the body part of a message has been tokenized, including scan, learn, forget. It sounds more proper to me to add a new plugin, such like "after_bayes_tokenize", add add all related code to this new plugin. Core BNR algorithm will be integrated into this part. Any comments and opinions are appreciated. Thank you Jianyong Dai
Created attachment 4106 [details] BNR patch This package include patches for BNR module target to SpamAssassin 3.2.0
Is libbnr being used? If so, I was under the impression that the GPL was incompatible with the Apache license.
I didn't use libbnr, I implement this module in perl. Thank you
I have finished Bayes Noise Reduction module. I attached patch files on bugzilla. And you can find documentation and performance data at http://docs.google.com/Doc?id=dfsk849w_13d4zm72 Any comments are welcome. Thank you Jianyong Dai
Thanks for the code! The test results seem quite bad, unfortunately :( It might be worth testing it with more recent spam, however; the stuff in the public corpus is quite old, and dates from before widespread use of "Bayes poisoning" text in spam.
Closing ancient stale bug.