5110 – EXTRA_MPART_TYPE fires on valid multipart/related

Bug 5110 - EXTRA_MPART_TYPE fires on valid multipart/related

Summary: EXTRA_MPART_TYPE fires on valid multipart/related

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (show other bugs)
Version:	3.1.5
Hardware:	Other other

Importance:	P5 normal
Target Milestone:	3.2.0
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:	post-perceptron
Keywords:

Depends on:	5270
Blocks:
	Show dependency tree

Reported:	2006-09-27 13:46 UTC by Nick Leverton
Modified:	2007-02-25 08:30 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nick Leverton 2006-09-27 13:46:29 UTC

EXTRA_MPART_TYPE matches on Content-Type headers which also have a type=  
parameter.  However RFC 2387 specifies that "multipart/related" MUST have a  
type parameter giving the MIME type of the root body part.  
  
Around 10% of my hits on this rule are for Content-Type=multipart/related with  
a type parameter, such as:  
Content-Type: multipart/related; type="text/html"; boundary="..."  
or  
Content-Type: multipart/related; type="multipart/alternative";  
  
In particular, M$ Exchange sometimes seems to make use of this format for  
including background images.  Much though I detest them, I'd prefer they 
didn't contribute to a spam score !

Comment 1 Nick Leverton 2006-09-27 14:09:07 UTC

I don't understand why there's a check for zero or one matches on  
'multipart' either in the RE for this rule.  It seems to me that 
 
(?:\s*multipart\/)? 
 
will match any input (and testing the RE confirms that).  Is this perhaps a 
typo for excluding multiparts, or something else ? 
 
May I suggest this rule for trial which excludes only multipart/related: 
 
m/^\s*\b(?!multipart\/related\b).* type=/i 
 
or perhaps this looser one which excludes all multiparts: 
 
m/^\s*\b(?!multipart\/).* type=/i

Comment 2 Theo Van Dinter 2006-09-27 15:09:20 UTC

  9.916  11.8320   0.5927    0.952   0.69    0.85  EXTRA_MPART_TYPE

So FWIW, both ham and spam hit on this, and it's basically always
multipart/related, so trying to avoid hitting on that type essentially means to
kill the rule.  Perhaps there's a way to clean up the ham hits by looking at
both the type and the type="??" value.  In general, I think this is just one of
those things that can definitely happen in ham but seems to be a lot more
prevalent in spam.  The scores are relatively low though IMO, and they'd be
lower if representative mails were included in the next score generation run:

score EXTRA_MPART_TYPE 0.847 0.815 0.733 1.091


BTW: I'm also not sure why the multipart bit is in the RE:

Content-Type =~ /(?:\s*multipart\/)?.* type=/i

Seems strange.

Comment 3 Justin Mason 2006-09-27 17:13:57 UTC

hey --

looks to me like the /(?:\s*multipart\/)?/ thing may have started off being just
"multipart/", then the (?:...)? was added later.  it's certainly superfluous.

Comment 4 Theo Van Dinter 2006-12-05 07:59:12 UTC

Hrm.  Interesting results:

 22.807  25.2886   0.1464    0.994   1.00    1.00  EXTRA_MPART_TYPE2
 26.188  29.0237   0.2928    0.990   0.67    1.00  EXTRA_MPART_TYPE3
 19.882  22.0423   0.1464    0.993   0.00    0.85  EXTRA_MPART_TYPE

In order:
- content-type includes /\btype=/i
- content-type starts with multipart/related
- original

So the diff between the original and #2 is probably that the original looks for
" type=" and #2 will accept ";type=", so that's an easy win.

I just threw in #3 because I was curious.  Essentially what this means is that
for my corpus, at least, multipart/related is very likely to be spam (and all of
the #2 hits also hit #3).  Interestingly, the difference between 2 and 3 are
people ignoring RFC 2387. :(

So part of me wants to just use #3.  While the 2x ham rate looks daunting, it
really means that instead of 2 ham hits, it was 4 ham hits.

Thoughts?

Comment 5 Justin Mason 2006-12-05 08:09:26 UTC

can you throw them into svn so that we can see how they do on all corpora?
this url will show it:

http://ruleqa.spamassassin.org/?daterev=last-night&rule=%2FEXTRA_MPART_TYPE

Comment 6 Justin Mason 2006-12-28 09:53:33 UTC

here's what they get in
http://ruleqa.spamassassin.org/?daterev=20061228-r490679-n&rule=%2FEXTRA_MPART_TYPE&srcpath=&s_detail=on&g=Change
:

0.00000  17.9181   0.1088   0.994    0.85    0.85  EXTRA_MPART_TYPE   
0.00000   2.1400   0.0000   1.000    0.66    0.85  EXTRA_MPART_TYPE bb-doc  
0.00000  10.1073   1.5162   0.870    0.79    0.85  EXTRA_MPART_TYPE bb-fredt  
0.00000  26.8785   0.0120   1.000    0.99    0.85  EXTRA_MPART_TYPE bb-jm  
0.00000  12.0507   0.0000   1.000    0.87    0.85  EXTRA_MPART_TYPE bb-zmi  
0.00000  11.0359   0.2279   0.980    0.85    0.85  EXTRA_MPART_TYPE cthielen  
0.00000   9.1298   0.0352   0.996    0.91    0.85  EXTRA_MPART_TYPE daf  
0.00000  20.4588   0.0133   0.999    0.98    0.85  EXTRA_MPART_TYPE jm  
0.00000  19.7498   0.1335   0.993    0.83    0.85  EXTRA_MPART_TYPE theo  

0.00000  20.7805   0.2403   0.989    0.77    0.00  T_EXTRA_MPART_TYPE2   
0.00000   2.2867   0.0000   1.000    0.67    0.00  T_EXTRA_MPART_TYPE2 bb-doc  
0.00000  10.5674   2.2054   0.827    0.73    0.00  T_EXTRA_MPART_TYPE2 bb-fredt  
0.00000  34.4643   0.1801   0.995    0.90    0.00  T_EXTRA_MPART_TYPE2 bb-jm  
0.00000  13.9682   0.0000   1.000    0.88    0.00  T_EXTRA_MPART_TYPE2 bb-zmi  
0.00000  14.6210   0.7443   0.952    0.77    0.00  T_EXTRA_MPART_TYPE2 cthielen  
0.00000  12.2418   0.2114   0.983    0.86    0.00  T_EXTRA_MPART_TYPE2 daf  
0.00000  27.3740   0.3070   0.989    0.86    0.00  T_EXTRA_MPART_TYPE2 jm  
0.00000  22.3068   0.1590   0.993    0.82    0.00  T_EXTRA_MPART_TYPE2 theo  

0.00000  26.3436   0.3369   0.987    0.73    0.00  T_EXTRA_MPART_TYPE3   
0.00000   6.4333   0.0000   1.000    0.84    0.00  T_EXTRA_MPART_TYPE3 bb-doc  
0.00000  20.3347   2.3432   0.897    0.80    0.00  T_EXTRA_MPART_TYPE3 bb-fredt  
0.00000  39.6935   0.0760   0.998    0.96    0.00  T_EXTRA_MPART_TYPE3 bb-jm  
0.00000  19.3598   0.0000   1.000    0.90    0.00  T_EXTRA_MPART_TYPE3 bb-zmi  
0.00000  20.6144   1.3975   0.937    0.72    0.00  T_EXTRA_MPART_TYPE3 cthielen  
0.00000  14.8961   0.0470   0.997    0.94    0.00  T_EXTRA_MPART_TYPE3 daf  
0.00000  28.7010   0.1601   0.994    0.93    0.00  T_EXTRA_MPART_TYPE3 jm  
0.00000  28.3078   0.3630   0.987    0.74    0.00  T_EXTRA_MPART_TYPE3 theo  

to be honest, I'm quite worried about the high ham rates in some corpora;
I've been finding (and seeing reports on the users list) that a few new rules
are firing on legit Outlook Express mails (EXTRA_MPART_TYPE allegedly being one).

I think I'd stick with EXTRA_MPART_TYPE, and lower its score.

Comment 7 Justin Mason 2007-01-05 05:30:11 UTC

I think we can just comment/remove T_EXTRA_MPART_TYPE2, T_EXTRA_MPART_TYPE3
during the perceptron run, and see what score it gives to EXTRA_MPART_TYPE; if
it's judged too high, we may need to manually score it (low) and re-run the
perceptron to take that into account.

marking this as a dependency of 5270 accordingly.

Comment 8 Theo Van Dinter 2007-01-09 23:32:04 UTC

ok, I've removed the test rules from my sandbox.

Sending        felicity/70_other.cf
Transmitting file data .
Committed revision 494754.

Comment 9 Justin Mason 2007-02-24 13:08:06 UTC

It's been given a much higher score by the GA:
score EXTRA_MPART_TYPE 2.501 2.636 1.359 1.404
as per comment 7, let's score that down to a manually-scored 1.0
and see how that affects accuracy.  see bug 5270 for more details of that...

Comment 10 Justin Mason 2007-02-25 08:30:24 UTC

ok -- set to 1.0, see bug 5270 for more details.