SA Bugzilla – Bug 5110
EXTRA_MPART_TYPE fires on valid multipart/related
Last modified: 2007-02-25 08:30:24 UTC
EXTRA_MPART_TYPE matches on Content-Type headers which also have a type= parameter. However RFC 2387 specifies that "multipart/related" MUST have a type parameter giving the MIME type of the root body part. Around 10% of my hits on this rule are for Content-Type=multipart/related with a type parameter, such as: Content-Type: multipart/related; type="text/html"; boundary="..." or Content-Type: multipart/related; type="multipart/alternative"; In particular, M$ Exchange sometimes seems to make use of this format for including background images. Much though I detest them, I'd prefer they didn't contribute to a spam score !
I don't understand why there's a check for zero or one matches on 'multipart' either in the RE for this rule. It seems to me that (?:\s*multipart\/)? will match any input (and testing the RE confirms that). Is this perhaps a typo for excluding multiparts, or something else ? May I suggest this rule for trial which excludes only multipart/related: m/^\s*\b(?!multipart\/related\b).* type=/i or perhaps this looser one which excludes all multiparts: m/^\s*\b(?!multipart\/).* type=/i
9.916 11.8320 0.5927 0.952 0.69 0.85 EXTRA_MPART_TYPE So FWIW, both ham and spam hit on this, and it's basically always multipart/related, so trying to avoid hitting on that type essentially means to kill the rule. Perhaps there's a way to clean up the ham hits by looking at both the type and the type="??" value. In general, I think this is just one of those things that can definitely happen in ham but seems to be a lot more prevalent in spam. The scores are relatively low though IMO, and they'd be lower if representative mails were included in the next score generation run: score EXTRA_MPART_TYPE 0.847 0.815 0.733 1.091 BTW: I'm also not sure why the multipart bit is in the RE: Content-Type =~ /(?:\s*multipart\/)?.* type=/i Seems strange.
hey -- looks to me like the /(?:\s*multipart\/)?/ thing may have started off being just "multipart/", then the (?:...)? was added later. it's certainly superfluous.
Hrm. Interesting results: 22.807 25.2886 0.1464 0.994 1.00 1.00 EXTRA_MPART_TYPE2 26.188 29.0237 0.2928 0.990 0.67 1.00 EXTRA_MPART_TYPE3 19.882 22.0423 0.1464 0.993 0.00 0.85 EXTRA_MPART_TYPE In order: - content-type includes /\btype=/i - content-type starts with multipart/related - original So the diff between the original and #2 is probably that the original looks for " type=" and #2 will accept ";type=", so that's an easy win. I just threw in #3 because I was curious. Essentially what this means is that for my corpus, at least, multipart/related is very likely to be spam (and all of the #2 hits also hit #3). Interestingly, the difference between 2 and 3 are people ignoring RFC 2387. :( So part of me wants to just use #3. While the 2x ham rate looks daunting, it really means that instead of 2 ham hits, it was 4 ham hits. Thoughts?
can you throw them into svn so that we can see how they do on all corpora? this url will show it: http://ruleqa.spamassassin.org/?daterev=last-night&rule=%2FEXTRA_MPART_TYPE
here's what they get in http://ruleqa.spamassassin.org/?daterev=20061228-r490679-n&rule=%2FEXTRA_MPART_TYPE&srcpath=&s_detail=on&g=Change : 0.00000 17.9181 0.1088 0.994 0.85 0.85 EXTRA_MPART_TYPE 0.00000 2.1400 0.0000 1.000 0.66 0.85 EXTRA_MPART_TYPE bb-doc 0.00000 10.1073 1.5162 0.870 0.79 0.85 EXTRA_MPART_TYPE bb-fredt 0.00000 26.8785 0.0120 1.000 0.99 0.85 EXTRA_MPART_TYPE bb-jm 0.00000 12.0507 0.0000 1.000 0.87 0.85 EXTRA_MPART_TYPE bb-zmi 0.00000 11.0359 0.2279 0.980 0.85 0.85 EXTRA_MPART_TYPE cthielen 0.00000 9.1298 0.0352 0.996 0.91 0.85 EXTRA_MPART_TYPE daf 0.00000 20.4588 0.0133 0.999 0.98 0.85 EXTRA_MPART_TYPE jm 0.00000 19.7498 0.1335 0.993 0.83 0.85 EXTRA_MPART_TYPE theo 0.00000 20.7805 0.2403 0.989 0.77 0.00 T_EXTRA_MPART_TYPE2 0.00000 2.2867 0.0000 1.000 0.67 0.00 T_EXTRA_MPART_TYPE2 bb-doc 0.00000 10.5674 2.2054 0.827 0.73 0.00 T_EXTRA_MPART_TYPE2 bb-fredt 0.00000 34.4643 0.1801 0.995 0.90 0.00 T_EXTRA_MPART_TYPE2 bb-jm 0.00000 13.9682 0.0000 1.000 0.88 0.00 T_EXTRA_MPART_TYPE2 bb-zmi 0.00000 14.6210 0.7443 0.952 0.77 0.00 T_EXTRA_MPART_TYPE2 cthielen 0.00000 12.2418 0.2114 0.983 0.86 0.00 T_EXTRA_MPART_TYPE2 daf 0.00000 27.3740 0.3070 0.989 0.86 0.00 T_EXTRA_MPART_TYPE2 jm 0.00000 22.3068 0.1590 0.993 0.82 0.00 T_EXTRA_MPART_TYPE2 theo 0.00000 26.3436 0.3369 0.987 0.73 0.00 T_EXTRA_MPART_TYPE3 0.00000 6.4333 0.0000 1.000 0.84 0.00 T_EXTRA_MPART_TYPE3 bb-doc 0.00000 20.3347 2.3432 0.897 0.80 0.00 T_EXTRA_MPART_TYPE3 bb-fredt 0.00000 39.6935 0.0760 0.998 0.96 0.00 T_EXTRA_MPART_TYPE3 bb-jm 0.00000 19.3598 0.0000 1.000 0.90 0.00 T_EXTRA_MPART_TYPE3 bb-zmi 0.00000 20.6144 1.3975 0.937 0.72 0.00 T_EXTRA_MPART_TYPE3 cthielen 0.00000 14.8961 0.0470 0.997 0.94 0.00 T_EXTRA_MPART_TYPE3 daf 0.00000 28.7010 0.1601 0.994 0.93 0.00 T_EXTRA_MPART_TYPE3 jm 0.00000 28.3078 0.3630 0.987 0.74 0.00 T_EXTRA_MPART_TYPE3 theo to be honest, I'm quite worried about the high ham rates in some corpora; I've been finding (and seeing reports on the users list) that a few new rules are firing on legit Outlook Express mails (EXTRA_MPART_TYPE allegedly being one). I think I'd stick with EXTRA_MPART_TYPE, and lower its score.
I think we can just comment/remove T_EXTRA_MPART_TYPE2, T_EXTRA_MPART_TYPE3 during the perceptron run, and see what score it gives to EXTRA_MPART_TYPE; if it's judged too high, we may need to manually score it (low) and re-run the perceptron to take that into account. marking this as a dependency of 5270 accordingly.
ok, I've removed the test rules from my sandbox. Sending felicity/70_other.cf Transmitting file data . Committed revision 494754.
It's been given a much higher score by the GA: score EXTRA_MPART_TYPE 2.501 2.636 1.359 1.404 as per comment 7, let's score that down to a manually-scored 1.0 and see how that affects accuracy. see bug 5270 for more details of that...
ok -- set to 1.0, see bug 5270 for more details.