SA Bugzilla – Bug 5830
MSGID_OUTLOOK_INVALID and BROKEN
Last modified: 2008-02-21 06:53:17 UTC
I noticed a particular Message-Id pattern, that seems to be *unique* to spam. Well, at least by grepping through a lot of ham (own corpus and some mailing list archives), this pattern never seems to be used legitimately. If anyone of you guys finds even a *single* hit in ham for the following Message-Id pattern, regardless of the X-Mailer, please let me know. Just egrep for '<.{8}\$.{8}\$.{8}@' in your ham's Message-Ids. Oh, right, the Summary. :) Well, the pattern seems to be a broken Outlook forgery, where the first 4 hex chars are missing. The time token seems to be quite right most of the time, though. Hence the Summary. This is about a BROKEN Outlook style Message-Id. Now, while MSGID_OUTLOOK_INVALID thoroughly checks the time token for validity, this rule is about a BROKEN Outlook Message-Id header, actually invalid, too. header __MSGID_OUTLOOK_888 Message-Id =~ /^<[0-9a-f]{8}(\$[0-9a-f]{8}){2}\@/ header __KB_OUTLOOK_MUA X-Mailer =~ /^Microsoft (Office )?Outlook\b/ meta MSGID_OUTLOOK_BROKEN __MSGID_OUTLOOK_888 && __KB_OUTLOOK_MUA The special __KB_OUTLOOK_MUA would not be necessary, if bug 5774 be fixed. Ijust went through some months spam corpus, and it seems about 99.99% of this particular broken Message-Id does hit the X-Mailer rule, too. Hence the meta rule -- it probably wouldn't be necessary, though. Some quickly gathered results: NO hits in ham found for __MSGID_OUTLOOK_888, whereas both this rule as well as the safety-net meta rule triggers on 25% or more of spam in my corpora of the last months.
Whoops, forgot to mention: As usual, I *only* checked low scoring spam. In this case score < 15, with a hit rate of 25% or more. I didn't even bother to check high scorers.
(In reply to comment #0) > Ijust went through some months spam corpus, and it seems about 99.99% of this > particular broken Message-Id does hit the X-Mailer rule, too. Hence the meta > rule -- it probably wouldn't be necessary, though. Hrm. Checked on the remaining ~0.01%, which actually are exactly 3 messages from Oct last year. I don't get it, the X-Mailer headers are perfect MS Outlook forged examples. Even the meta rule should have hit here, too. grep shows 'em just fine. SA seems to not see these X-Mailer headers at all. Even $status->get('X-Mailer') fails to return it. Weird.
fyi: 10.471 11.2975 0.0000 1.000 1.00 1.00 KB_RATWARE_MSGID header KB_RATWARE_MSGID Message-Id =~ /^<.{8}\$.{8}\$.{8}\@/
(In reply to comment #3) > 10.471 11.2975 0.0000 1.000 1.00 1.00 KB_RATWARE_MSGID ^^^^^^ And that's with the absolute simplest RE, which actually was intended as a grep test for FPs only. ;) Thanks, Theo. And I really like that rule name. Regarding your "FYI" style responses -- is that good information? ;) Any requirements for getting in new rules, other than a considerably better S/O than 0.8, some more mass-testing results and a vote or two?
(In reply to comment #4) > Regarding your "FYI" style responses -- is that good information? ;) Any > requirements for getting in new rules, other than a considerably better S/O than > 0.8, some more mass-testing results and a vote or two? +1 to that. theo, could you check that in? I can't see it in rulesrc and it looks good ;)
Oh, goodie. :) Anyway, which one? The fuzzy one, the one with correct hex strings, or the one with the additional constraint of forging Outlook? The latter, again, would require bug 5774 to be fixed. And please don't get me wrong, I'm just curious how to preceed, since I'm rather new to SA bugzilla. Though I'm really familiar with various bugzillas, every project got it's own special way of handling things.
(In reply to comment #6) > Oh, goodie. :) > > Anyway, which one? The fuzzy one, the one with correct hex strings, or the one > with the additional constraint of forging Outlook? The latter, again, would > require bug 5774 to be fixed. well, I'd like to get them checked in for further testing in our Rule-QA system: http://ruleqa.spamassassin.org/ I've gone ahead and done this since Theo seems busy: : jm 138...; svn commit -m "add test rules from bug 5830: KB_MSGID_OUTLOOK_BROKEN" rulesrc/sandbox/jm/22_bug_5830.cf Adding rulesrc/sandbox/jm/22_bug_5830.cf Transmitting file data . Committed revision 629394. here's what I checked in: header __KB_MSGID_OUTLOOK_888 Message-Id =~ /^<[0-9a-f]{8}(?:\$[0-9a-f]{8}){2}\@/ header __KB_OUTLOOK_MUA X-Mailer =~ /^Microsoft (?:Office )?Outlook\b/ meta KB_MSGID_OUTLOOK_BROKEN __KB_MSGID_OUTLOOK_888 && __KB_OUTLOOK_MUA note the minor changes; renames to include the KB prefix and use of (?:...) instead of (...) for efficiency. > And please don't get me wrong, I'm just curious how to preceed, since I'm rather > new to SA bugzilla. Though I'm really familiar with various bugzillas, every > project got it's own special way of handling things. the way we do it is: - take rules that seem likely to be useful (at a glance) and add them to a file in rulesrc/sandbox - wait a day or two and see how they perform in results on http://ruleqa.spamassassin.org/ - if they're good, do some further discussion about: wiping out remaining false positives (or dangers of same); ways to improve the hitrates slightly; ways to reduce redundant overlap with existing rules; ways to trim down the number of versions of the proposed new rules. - once we're happy they get checked in for inclusion in 3.3.0, and possibly backporting to 3.2.x sa-updates as well.
Thanks for the rather exhaustive explanation, Justin. :) (In reply to comment #7) > well, I'd like to get them checked in for further testing in our Rule-QA > system: http://ruleqa.spamassassin.org/ > Committed revision 629394. > here's what I checked in: I'm comfortable with SVN, and finally got an uptodate trunk checkout again. > note the minor changes; renames to include the KB prefix and use of (?:...) > instead of (...) for efficiency. Cool. However, seriously, instead of the custom __KB_OUTLOOK_MUA rule, bug 5774 should be fixed. It's simply adding the optional Office part. It's probably just fine for testing, though. Also, I kind of fell in love with the name Theo used for the rule. KB_RATWARE_MSGID just sounds awesome. :) > - wait a day or two and see how they perform in results on > http://ruleqa.spamassassin.org/ > > - if they're good, do some further discussion about: wiping out remaining false > positives (or dangers of same); ways to improve the hitrates slightly; ways to > reduce redundant overlap with existing rules; ways to trim down the number of > versions of the proposed new rules. Time to wait a day...
(In reply to comment #8) > > note the minor changes; renames to include the KB prefix and use of (?:...) > > instead of (...) for efficiency. > > Cool. However, seriously, instead of the custom __KB_OUTLOOK_MUA rule, bug 5774 > should be fixed. It's simply adding the optional Office part. It's probably just > fine for testing, though. now done. > Also, I kind of fell in love with the name Theo used for the rule. > KB_RATWARE_MSGID just sounds awesome. :) sure ;) > > - wait a day or two and see how they perform in results on > > http://ruleqa.spamassassin.org/ > > > > - if they're good, do some further discussion about: wiping out remaining false > > positives (or dangers of same); ways to improve the hitrates slightly; ways to > > reduce redundant overlap with existing rules; ways to trim down the number of > > versions of the proposed new rules. > > Time to wait a day... A day or so ;) here are the results: http://ruleqa.spamassassin.org/today/T_KB_MSGID_OUTLOOK_BROKEN/detail looks great. no false positives in any corpus, and no significant overlaps with other rules. Score map has most of its hits between 3 and 6 points. Most of the hits seem fresh. great stuff! btw, you were asking about the T_ prefix? it's used to force rule scores to 0.01 for test rules. Once a rule is measured as having "good enough" results, it's allowed to not be a T_ rule. anyway, this is now in the trunk ruleset as of r629813.