Bug 5861 - Bayes problem (too common tokens etc)
Summary: Bayes problem (too common tokens etc)
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.2.4
Hardware: Other All
: P5 normal
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-21 07:52 UTC by Henrik Krohns
Modified: 2019-07-08 07:10 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Debug output of message text/plain None Henrik Krohns [HasCLA]
Mark only precense of DKIM/DomainKey-Signature on Bayes text/plain None Henrik Krohns [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Krohns 2008-03-21 07:52:31 UTC
Created attachment 4277 [details]
Debug output of message

Hi, please see the attached debug output of bayes and message.

I have real hard time learning these kind of messages into bayes. The problem is when you get mostly ham from gmail.com or some other common place, you can't really learn it well.

I have already done:

bayes_ignore_header Received  (too many gmail tokens)
bayes_ignore_header DKIM-Signature  (too many gmail tokens)
bayes_ignore_header DomainKey-Signature  (too many gmail tokens)

And still I get a mere BAYES_50, there are too many gmail tokens left!

How about some option like "bayes_ignore_token /gmail/"? Is there anything coming up in 3.3.0 that might help the cause?

Another funny thing, I'm not sure why my amavis mail_id (L0YXx-simPYV) is learned as a token? What good would it do, since it's always random?

Is there any learning on the attachment filenames? I don't see any tokens.
Comment 1 Henrik Krohns 2008-03-21 09:46:01 UTC
This is the crude patch that I'm testing for now..

--- Bayes.pm.orig       Fri Mar 21 18:44:41 2008
+++ Bayes.pm    Fri Mar 21 18:46:39 2008
@@ -329,6 +329,7 @@
   my %tokens;
   foreach my $token (@tokens) {
     next unless length($token); # skip 0 length tokens
+    next if $token =~ /(?:gmail|yahoo|hotmail)/; # skip too hammy tokens
     $tokens{substr(sha1($token), -5)} = $token;
   }

Comment 2 Henrik Krohns 2008-03-21 09:56:54 UTC
Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the default ignore list?

[29942] dbg: bayes: token 'HDKIM-Signature:beta' => 0.00806992358626491
[29942] dbg: bayes: token 'HDomainKey-Signature:beta' => 0.00893960579973894
[29942] dbg: bayes: token 'HDKIM-Signature:mime-version' => 0.0102741290765491
[29942] dbg: bayes: token 'HDKIM-Signature:received' => 0.0102965519261728
[29942] dbg: bayes: token 'HDKIM-Signature:sk:domaink' => 0.0103000102852899
[29942] dbg: bayes: token 'HDKIM-Signature:relaxed' => 0.0103817043841538
[29942] dbg: bayes: token 'HDKIM-Signature:rsa-sha256' => 0.0111528115306108
[29942] dbg: bayes: token 'HDomainKey-Signature:content-type' => 0.0117297689443074
[29942] dbg: bayes: token 'HDomainKey-Signature:mime-version' => 0.0117316778713491
[29942] dbg: bayes: token 'HDomainKey-Signature:subject' => 0.0120925906621403
[29942] dbg: bayes: token 'HDomainKey-Signature:message-id' => 0.0121248330712668
[29942] dbg: bayes: token 'HDomainKey-Signature:sk:uuyE1wR' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:TMZO4KJ' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:oiyxr0w' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:i0UgPfZ' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:peWuJ2k' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:Hyhy38j' => 0.986543689320388
[29942] dbg: bayes: token 'HDKIM-Signature:sk:z8grhBD' => 0.986543689320388
[29942] dbg: bayes: token 'HDomainKey-Signature:sk:WCcy0Q8' => 0.986543689320388

Or then make it more intelligent and skip the static ones on the top, since they are generally the same anywhere.

Comment 3 Theo Van Dinter 2008-03-21 11:50:19 UTC
(In reply to comment #2)
> Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the
> default ignore list?

+1

There's lots of useful header information, but cryptographic signatures aren't included in that imo.
Comment 4 Justin Mason 2008-03-22 05:54:11 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > Maybe it should be wise to add DKIM-Signature and DomainKey-Signature to the
> > default ignore list?
> 
> +1
> 
> There's lots of useful header information, but cryptographic signatures aren't
> included in that imo.

However the presence of the header might be, for some people, or some tokens in those headers.  For example, you are *DEFINITELY* losing good data from the Received: headers, I can guarantee that.

It's important that we don't make changes to the ignore list without benchmarking its effects using 10-fold cross validation testing.  One thing I've found, time and time again, is that Bayes probability combining is a lot smarter than you're giving it credit for -- relatively-weak "ham" or "spam" probabilities will cancel each other out, allowing stronger tokens to have an effect quite nicely.  It's not always as simple as they may appear in isolation.

(anyway, having said that, if someone wants to do a 10-fold cross-validation run testing ignoring the DK/DKIM sig headers, go ahead.)
Comment 5 Henrik Krohns 2008-03-22 06:29:50 UTC
So is there something that can help with these short messages, that don't create many tokens? When there aren't enough body tokens, by default all those hammy header tokens are sure to prevent correct scoring. It forces me to ignore such headers.

Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored it since I can't figure out any purpose to bloat my db.

Comment 6 Henrik Krohns 2008-04-09 23:17:02 UTC
Created attachment 4293 [details]
Mark only precense of DKIM/DomainKey-Signature on Bayes
Comment 7 Justin Mason 2008-04-10 01:48:34 UTC
(In reply to comment #5)
> So is there something that can help with these short messages, that don't
> create many tokens? When there aren't enough body tokens, by default all those
> hammy header tokens are sure to prevent correct scoring. It forces me to ignore
> such headers.

Training on error should help -- train mostly on FPs and FNs from now on.

> Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored
> it since I can't figure out any purpose to bloat my db.

Consider a site with 2 MXes -- a primary and secondary MX.  both are listed as
IPs in internal_networks.  For some reason, spammers tend to like sending spam
via the secondary.  The presence of that MX's IP in the
'X-Spam-Relays-Internal' hdr therefore becomes a spam sign, for that site.

If, on the other hand, a token appears equally in both ham and spam:

  - it's P value will tend towards the middle ground: 0.5
  - this means that it will fall outside $MIN_PROB_STRENGTH:

    # Should we ignore tokens with probs very close to the middle ground (.5)?
    # tokens need to be outside the [ .5-MPS, .5+MPS ] range to be used.
    our $MIN_PROB_STRENGTH = 0.346;

  - tokens outside that range are unused

  - unused tokens don't have their access times updated, and therefore
    are expired from the Bayes db.

thanks for the patch -- I'll apply it.  we should probably be running 
a 10-fold cross validation, but I'm a bit busy and I think it's a good
idea as a hunch. ;)

: jm 573...; svn commit -m "bug 5861: add DKIM-Signature and DomainKey-Signature to the set of headers whose contents are ignored for Bayes; their presence is marked, however.  thanks to Henrik Krohns" lib/Mail/SpamAssassin/Plugin/Bayes.pm
Sending        lib/Mail/SpamAssassin/Plugin/Bayes.pm
Transmitting file data .
Committed revision 646688.

Comment 8 Henrik Krohns 2008-04-10 03:50:18 UTC
I'm not comfortable in closing this bug yet.

(In reply to comment #7)
> (In reply to comment #5)
> > So is there something that can help with these short messages, that don't
> > create many tokens? When there aren't enough body tokens, by default all those
> > hammy header tokens are sure to prevent correct scoring. It forces me to ignore
> > such headers.
> 
> Training on error should help -- train mostly on FPs and FNs from now on.

How can this help? If it wasn't obvious, ofcourse I trained it. It didn't help.

A mail from gmail had so many hammy tokens, it is impossible to train without other more specific tokens.

Isn't there more stuff you can create tokens from, like filenames? What if you get a mass of spam from gmail, containing only .doc attachment and no body? It will still score BAYES_50 or something, all the hammy gmail tokens will prevent better scores!! I demonstrated this already in my first post. Atleast my DKIM patch should help remove some of excess tokens. I'll try to test how it affects.

I know you guys are busy, but I think this isn't something to just shrug off. Or is it just something that is rare and "gotta live with it"? Is there any interest from your side in enchancing the Bayes engine or does it have to come from contributions? You are the ones that know the system best.


> > Also whats the deal with saving those X-Spam-Relays-Internal tokens? I ignored
> > it since I can't figure out any purpose to bloat my db.
> 
> Consider a site with 2 MXes -- a primary and secondary MX.  both are listed as
> IPs in internal_networks.  For some reason, spammers tend to like sending spam
> via the secondary.  The presence of that MX's IP in the
> 'X-Spam-Relays-Internal' hdr therefore becomes a spam sign, for that site.
>

There is still atleast one question unanswered. Why is the _unique_ mail id recorded as a token? I understand IP, but not that.

If you don't have time, then please answer when you have it. It seems you just try to blaze though as fast as you can.

I will try to analyze and help with this, but I could really use some insightful input.

Comment 9 Justin Mason 2008-04-10 04:07:16 UTC
I agree, attachment filenames would be a great source of tokens.  *adding* new tokens isn't likely to be a problem.

If you would like to see this stuff changed, here's what to do -- run a ten-fold cross-validation that demonstrates an improvement in accuracy:

http://svn.apache.org/repos/asf/spamassassin/trunk/masses/bayes-testing/

that's how we measure the effects of Bayes tweaks.  stuff that performs well in that testing is MUCH more likely to get in.
Comment 10 Henrik Krohns 2008-04-10 04:16:58 UTC
Ok I'll have a look at that.. might take a while as I need to create some good corpus first.

Comment 11 Justin Mason 2008-04-10 04:25:13 UTC
(In reply to comment #10)
> Ok I'll have a look at that.. might take a while as I need to create some good
> corpus first.

if you like, I can share one for you to use...
Comment 12 Henrik Krohns 2009-03-31 13:37:00 UTC
I stumbled on a curious FN. Seems wonky that a header or two can dominate the whole scoring. Just food for thought, hopefully some year I have time to do deeper tests..

Headers:

X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166 definitions=2009-03-28_05:2009-03-27,2009-03-28,2009-03-27 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=26 spamscore=26 ipscore=0 phishscore=99 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=
mlx engine=5.0.0-0811170000 definitions=main-0903280069

Parse:

dbg: bayes: tok_get_all: token count: 147
dbg: bayes: token HX-Proofpoint-Virus-Version:1.12.7400 => 0.00190587422573309
dbg: bayes: token HX-Proofpoint-Virus-Version:fsecure => 0.00434363906304506
dbg: bayes: token HX-Proofpoint-Virus-Version:vendor => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:definitions => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:engine => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:signatures => 0.00458230337259314
dbg: bayes: token HX-Proofpoint-Virus-Version:sk:2.4.4,1 => 0.00458803535339196
dbg: bayes: token HX-Proofpoint-Virus-Version:sk:2009-03 => 0.00560260989702483
dbg: bayes: token Hx-mimeole:Exchange => 0.00569605506742619
dbg: bayes: token Hx-mimeole:Produced => 0.00589653854308459
dbg: bayes: token Hx-mimeole:Microsoft => 0.00589653854308459
dbg: bayes: token Hx-mimeole:V6.5 => 0.00700724277522037
dbg: bayes: token HContent-class:content-classes => 0.00840337534766064
dbg: bayes: token HContent-class:urn => 0.00840353709745381
dbg: bayes: token HContent-class:message => 0.00856128997205224
dbg: bayes: token HX-Proofpoint-Spam-Details:sk:5.0.0-0 => 0.0107477511672061
dbg: bayes: token D*live.com => 0.98880239141895
dbg: bayes: token HX-Proofpoint-Spam-Details:mlx => 0.0132332664149436
dbg: bayes: token HX-Proofpoint-Spam-Details:adultscore => 0.0132332664149436
dbg: bayes: token HX-Proofpoint-Spam-Details:spam => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:ipscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:phishscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:spamscore => 0.0134194375404317
dbg: bayes: token HX-Proofpoint-Spam-Details:bulkscore => 0.0137087277138293
dbg: bayes: token sk:helpdes => 0.015616392737519
dbg: bayes: token HX-Proofpoint-Spam-Details:rule => 0.0158325080420237
dbg: bayes: token HX-Proofpoint-Spam-Details:notspam => 0.0158599391972407
dbg: bayes: token HX-Proofpoint-Spam-Details:score => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:definitions => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:adjust => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:policy => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:reason => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:engine => 0.0171270007084152
dbg: bayes: token HX-Proofpoint-Spam-Details:default => 0.0171566355845046
dbg: bayes: token HX-Proofpoint-Spam-Details:classifier => 0.0173669864520215
dbg: bayes: token mailbox => 0.0262678473925646
dbg: bayes: token Username => 0.0444462667192503
dbg: bayes: token Unit => 0.107186995611258
dbg: bayes: token increase => 0.88997938686673
dbg: bayes: token username => 0.110283938874452
dbg: bayes: token increased => 0.869317730990728
dbg: bayes: token size => 0.84902152925483
dbg: bayes: token message => 0.151844260253207
dbg: bayes: score = 1.93606242149258e-12
Comment 13 Matt Kettler 2009-03-31 18:37:51 UTC
Are the  X-Proofpoint-* headers in all of your mail (ie: added by an upstream server?)





Comment 14 Henrik Krohns 2009-03-31 20:59:26 UTC
I don't see how it's relevant, but no. It's from some US uni.

The point is that there probably should be some limit on how many tokens to get from a header. If I learn that as spam, all ham mail containing those headers will be strongly biased to spam (an uneducated, but logical guess).
Comment 15 Matt Kettler 2009-03-31 21:27:53 UTC
The question was relevant because a header that is in all of your mail mail, and has a lot of unchanging text in it, should have a local ignore.

I do see your point on capping the number of tokens per header.


Comment 16 Henrik Krohns 2009-03-31 21:40:11 UTC
Yes you have a general point, but not much relevance to the problem.

I wonder what would be the best way to fix it. Select few highest and lowest scoring tokens from single header? I guess some validation runs would be needed..
Comment 17 Justin Mason 2009-04-01 01:44:55 UTC
(In reply to comment #14)
> I don't see how it's relevant, but no. It's from some US uni.
> 
> The point is that there probably should be some limit on how many tokens to get
> from a header. If I learn that as spam, all ham mail containing those headers
> will be strongly biased to spam (an uneducated, but logical guess).

I think you're overestimating it's effects on the chi-square probability combining algorithm; actually, there's a good chance those values won't skew it much, assuming there are stronger tokens found elsewhere.

The only way to get a useful idea of what's really happening is to run a 10-fold cross validation run.  http://wiki.apache.org/spamassassin/TenFoldCrossValidation
Comment 18 Justin Mason 2010-01-27 02:31:51 UTC
moving some 3.3.0-targeted bugs into the vague Future.  feel free to retarget to 3.3.1 if you think you'll be able to work on them
Comment 19 Justin Mason 2010-01-27 03:16:31 UTC
reassigning, too
Comment 20 Henrik Krohns 2019-07-08 07:10:28 UTC
Closing my own old bug. I don't think there's anything to do here.