5830 – MSGID_OUTLOOK_INVALID and BROKEN

Bug 5830 - MSGID_OUTLOOK_INVALID and BROKEN

Summary: MSGID_OUTLOOK_INVALID and BROKEN

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (show other bugs)
Version:	3.2.4
Hardware:	Other other

Importance:	P5 normal
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-02-16 17:59 UTC by Karsten Bräckelmann
Modified:	2008-02-21 06:53 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Karsten Bräckelmann 2008-02-16 17:59:52 UTC

I noticed a particular Message-Id pattern, that seems to be *unique* to spam.
Well, at least by grepping through a lot of ham (own corpus and some mailing
list archives), this pattern never seems to be used legitimately.

If anyone of you guys finds even a *single* hit in ham for the following
Message-Id pattern, regardless of the X-Mailer, please let me know. Just egrep
for '<.{8}\$.{8}\$.{8}@' in your ham's Message-Ids.


Oh, right, the Summary. :)  Well, the pattern seems to be a broken Outlook
forgery, where the first 4 hex chars are missing. The time token seems to be
quite right most of the time, though. Hence the Summary. This is about a BROKEN
Outlook style Message-Id.

Now, while MSGID_OUTLOOK_INVALID thoroughly checks the time token for validity,
this rule is about a BROKEN Outlook Message-Id header, actually invalid, too.

header __MSGID_OUTLOOK_888   Message-Id =~ /^<[0-9a-f]{8}(\$[0-9a-f]{8}){2}\@/
header __KB_OUTLOOK_MUA      X-Mailer =~ /^Microsoft (Office )?Outlook\b/

meta   MSGID_OUTLOOK_BROKEN  __MSGID_OUTLOOK_888 &&  __KB_OUTLOOK_MUA


The special __KB_OUTLOOK_MUA would not be necessary, if bug 5774 be fixed.

Ijust went through some months spam corpus, and it seems about 99.99% of this
particular broken Message-Id does hit the X-Mailer rule, too. Hence the meta
rule -- it probably wouldn't be necessary, though.

Some quickly gathered results:  NO hits in ham found for __MSGID_OUTLOOK_888,
whereas both this rule as well as the safety-net meta rule triggers on 25% or
more of spam in my corpora of the last months.

Comment 1 Karsten Bräckelmann 2008-02-16 18:17:10 UTC

Whoops, forgot to mention:  As usual, I *only* checked low scoring spam. In this
case score < 15, with a hit rate of 25% or more.  I didn't even bother to check
high scorers.

Comment 2 Karsten Bräckelmann 2008-02-16 19:50:39 UTC

(In reply to comment #0)
> Ijust went through some months spam corpus, and it seems about 99.99% of this
> particular broken Message-Id does hit the X-Mailer rule, too. Hence the meta
> rule -- it probably wouldn't be necessary, though.

Hrm.  Checked on the remaining ~0.01%, which actually are exactly 3 messages
from Oct last year. I don't get it, the X-Mailer headers are perfect MS Outlook
forged examples. Even the meta rule should have hit here, too.

grep shows 'em just fine. SA seems to not see these X-Mailer headers at all.
Even $status->get('X-Mailer') fails to return it.  Weird.

Comment 3 Theo Van Dinter 2008-02-18 14:16:04 UTC

fyi:

 10.471  11.2975   0.0000    1.000   1.00    1.00  KB_RATWARE_MSGID

header   KB_RATWARE_MSGID       Message-Id =~ /^<.{8}\$.{8}\$.{8}\@/

Comment 4 Karsten Bräckelmann 2008-02-18 15:53:55 UTC

(In reply to comment #3)
>  10.471  11.2975   0.0000    1.000   1.00    1.00  KB_RATWARE_MSGID
                     ^^^^^^
And that's with the absolute simplest RE, which actually was intended as a grep
test for FPs only. ;)

Thanks, Theo.  And I really like that rule name.

Regarding your "FYI" style responses -- is that good information? ;)  Any
requirements for getting in new rules, other than a considerably better S/O than
0.8, some more mass-testing results and a vote or two?

Comment 5 Justin Mason 2008-02-19 02:30:02 UTC

(In reply to comment #4)
> Regarding your "FYI" style responses -- is that good information? ;)  Any
> requirements for getting in new rules, other than a considerably better S/O than
> 0.8, some more mass-testing results and a vote or two?

+1 to that.
theo, could you check that in?  I can't see it in rulesrc and it looks good ;)

Comment 6 Karsten Bräckelmann 2008-02-19 08:30:02 UTC

Oh, goodie. :)

Anyway, which one? The fuzzy one, the one with correct hex strings, or the one
with the additional constraint of forging Outlook? The latter, again, would
require bug 5774 to be fixed.

And please don't get me wrong, I'm just curious how to preceed, since I'm rather
new to SA bugzilla. Though I'm really familiar with various bugzillas, every
project got it's own special way of handling things.

Comment 7 Justin Mason 2008-02-20 01:37:47 UTC

(In reply to comment #6)
> Oh, goodie. :)
> 
> Anyway, which one? The fuzzy one, the one with correct hex strings, or the one
> with the additional constraint of forging Outlook? The latter, again, would
> require bug 5774 to be fixed.

well, I'd like to get them checked in for further testing in our Rule-QA system:
http://ruleqa.spamassassin.org/

I've gone ahead and done this since Theo seems busy:

: jm 138...; svn commit -m "add test rules from bug 5830:
KB_MSGID_OUTLOOK_BROKEN" rulesrc/sandbox/jm/22_bug_5830.cf
Adding         rulesrc/sandbox/jm/22_bug_5830.cf
Transmitting file data .
Committed revision 629394.

here's what I checked in:

header __KB_MSGID_OUTLOOK_888   Message-Id =~ /^<[0-9a-f]{8}(?:\$[0-9a-f]{8}){2}\@/
header __KB_OUTLOOK_MUA      X-Mailer =~ /^Microsoft (?:Office )?Outlook\b/

meta   KB_MSGID_OUTLOOK_BROKEN  __KB_MSGID_OUTLOOK_888 &&  __KB_OUTLOOK_MUA


note the minor changes; renames to include the KB prefix and use of (?:...)
instead of (...) for efficiency.

> And please don't get me wrong, I'm just curious how to preceed, since I'm rather
> new to SA bugzilla. Though I'm really familiar with various bugzillas, every
> project got it's own special way of handling things.

the way we do it is:

- take rules that seem likely to be useful (at a glance) and add them to a file
in rulesrc/sandbox

- wait a day or two and see how they perform in results on
http://ruleqa.spamassassin.org/

- if they're good, do some further discussion about: wiping out remaining false
positives (or dangers of same); ways to improve the hitrates slightly; ways to
reduce redundant overlap with existing rules; ways to trim down the number of
versions of the proposed new rules.

- once we're happy they get checked in for inclusion in 3.3.0, and possibly
backporting to 3.2.x sa-updates as well.

Comment 8 Karsten Bräckelmann 2008-02-20 12:12:18 UTC

Thanks for the rather exhaustive explanation, Justin. :)

(In reply to comment #7)
> well, I'd like to get them checked in for further testing in our Rule-QA 
> system: http://ruleqa.spamassassin.org/

> Committed revision 629394.
> here's what I checked in:

I'm comfortable with SVN, and finally got an uptodate trunk checkout again.

> note the minor changes; renames to include the KB prefix and use of (?:...)
> instead of (...) for efficiency.

Cool.  However, seriously, instead of the custom __KB_OUTLOOK_MUA rule, bug 5774
should be fixed. It's simply adding the optional Office part. It's probably just
fine for testing, though.

Also, I kind of fell in love with the name Theo used for the rule.
KB_RATWARE_MSGID just sounds awesome. :)


> - wait a day or two and see how they perform in results on
> http://ruleqa.spamassassin.org/
> 
> - if they're good, do some further discussion about: wiping out remaining false
> positives (or dangers of same); ways to improve the hitrates slightly; ways to
> reduce redundant overlap with existing rules; ways to trim down the number of
> versions of the proposed new rules.

Time to wait a day...

Comment 9 Justin Mason 2008-02-21 06:53:17 UTC

(In reply to comment #8)
> > note the minor changes; renames to include the KB prefix and use of (?:...)
> > instead of (...) for efficiency.
> 
> Cool.  However, seriously, instead of the custom __KB_OUTLOOK_MUA rule, bug 5774
> should be fixed. It's simply adding the optional Office part. It's probably just
> fine for testing, though.

now done.

> Also, I kind of fell in love with the name Theo used for the rule.
> KB_RATWARE_MSGID just sounds awesome. :)

sure ;)

> > - wait a day or two and see how they perform in results on
> > http://ruleqa.spamassassin.org/
> > 
> > - if they're good, do some further discussion about: wiping out remaining false
> > positives (or dangers of same); ways to improve the hitrates slightly; ways to
> > reduce redundant overlap with existing rules; ways to trim down the number of
> > versions of the proposed new rules.
> 
> Time to wait a day...

A day or so ;)  here are the results:
http://ruleqa.spamassassin.org/today/T_KB_MSGID_OUTLOOK_BROKEN/detail

looks great.  no false positives in any corpus, and no significant
overlaps with other rules.  Score map has most of its hits between 3 and
6 points.  Most of the hits seem fresh.  great stuff!

btw, you were asking about the T_ prefix?  it's used to force rule scores
to 0.01 for test rules.  Once a rule is measured as having "good enough"
results, it's allowed to not be a T_ rule.

anyway, this is now in the trunk ruleset as of r629813.