Bug 3131 - RFE: add an ":all" modifier to body/rawbody to match entire message body text as 1 string
Summary: RFE: add an ":all" modifier to body/rawbody to match entire message body text...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 2.63
Hardware: All Linux
: P4 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-03-05 23:39 UTC by Loren Wilton
Modified: 2006-06-27 10:45 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Loren Wilton 2004-03-05 23:39:23 UTC
I have a class of spams that appear sometimes in html and sometimes in plain 
text.  A characteristic of all of these spams is a signature in the form:

Regards,
Alpha Foo (or other 2-word random name)

In html:

<P>Regards,
<P>Alpah Foo
<P>

Or:

<P>Regards,<BR>
Alpha Foo<BR>

Seemingly a simple body test should catch this:

body BOGUS_SIG  /\bRegards, [A-Z][a-z]+ [A-Z][a-z]+\b/

This in fact works on the 20% or so of these spams that are plain text.  It 
does NOT work on any of the HTML spams, because "body" breaks into multiple 
hunks at <P> marks or <BR> marks!  Thus, "body" in html is approximately as 
useless as rawbody when trying to find a specific sequence of words that will 
match and might span a line break.

Now, it will be argued (erroneously) that this can be handled by a meta, and 
thus splitting the body isn't a problem:

body __REGARDS /\bRegards,/
body __SIG /\b[A-Z][a-z]+ [A-Z][a-z]+\b/
meta BOGUS_SIG (__REGARDS && __SIG)

I leave it to the reader to figure out why that one won't work.

Thus, the ONLY current way to detect this particular spam signature is to use 
FULL.  Which means that the regex now has to parse html and line breaks.  And 
will fail if the body is encoded in quoted-printable or base64.  Or if the text 
appears in a header.  Or probably any of another possible obfuscations or 
erroneous hits.

It is argued that 'body' needs to be separate pieces to reduce regex overhead.  
I argue that forcing simple body tests onto full (which is by definition a 
larger hunk of text than the combined body of the message) INCREASES overhead, 
both due to searching a larger text string, and because the tests themselves 
become very convoluted to attempt to un-encode all of the various ways that a 
body can be encoded.  Thus the body decoding has to be done multiple times.

Another argument would of course be that we don't need to detect obvious spam 
signatures, since being able to look for them would perhaps increase SA 
overhead.  The rejoinder to that is, what the heck is the purpose of SA if not 
to detect spam by its characteristics?  Do we expect the spammers to purposely 
code their spams to make them easy for SA to detect?
Comment 1 Colin Ogilvie 2004-03-06 07:29:45 UTC
I would certainly not like to see this rule added as I've just tried it locally,
and it catches a common way I, and others, sign their emails at times. 

"Regards, Colin Ogilvie"

Also, it doesn't catch the example you gave here of

Regards,
Alpha Foo

which would catch EVERY email I sent from my home computer and a lot of the ones
I get from companies etc.
Comment 2 Loren Wilton 2004-03-06 07:53:49 UTC
This rule was not intended to be used standalone, but as part of a meta to 
catch the particular spams.  These spams have about 4 identifying 
characteristics that between them are quite unique.  Loosing any of the parts 
of the meta increases the likelyhood of a FP considerably.

As of last night, the combined meta has been catching my spams with no FPs on a 
100K corpus.  The particular case that I had to code with a full body search 
does indeed produce many FPs by itself.

The new strain of spams we are starting to see have very small bodies, often in 
plain text, and do NOT contain all of the usual obfuscation tricks that are so 
easy to detect and filter out.  Filtering them can be quite tricky, and the 
possibility of FPs higher than on more traditional spam, so the rules have to 
be watched carefully and tweaked as necessary rather than thrown in a box and 
forgotten.

Note I am NOT proposing this rule as part of the SA main tests; it would not be 
approprate.  It IS however appropriate for this current particular spam when 
used with other tests.
Comment 3 Loren Wilton 2004-03-06 07:55:49 UTC
The current rule is

full __REGARDLESS_SIG m|Regards,[\s\r\n<pP>]{0,10}[A-Z][a-z]+\s[A-Z][a-z]+|
Comment 4 Theo Van Dinter 2004-03-06 16:26:29 UTC
Well, here's my 2 cents:

1) it used to be that the tests would run over the whole thing at once.  that 
blew up horribly due to RE issues/backtracking/etc.  that's when we went to 
doing things by paragraph.

2) besides your example, the current system works very well.

3) you're upset that <P> makes lines treated as different paragraphs, but that's 
what <P> does -- makes different paragraphs.

4) "Do we expect the spammers to purposely code their spams to make them easy 
for SA to detect?"   The answer of course is no, but what you're looking for 
isn't really a spamsign, so ...

5) I agree that full is not appropriate for "simple" tests, that's not what it's 
meant to be used for.  I'd actually like to get rid of "full" since nothing we 
have uses it, and if you're resorting to full there's likely better ways to 
search for what you want.

6) I wouldn't mind adding in some form of ":all" functionality for body/rawbody 
which would basically do a join("",@array_of_paragraphs) && s/\s+/ /g before 
doing the test.  That way people who want to try it can, and if they blow their 
own system up with bad RE -- well, that's not our problem. ;)
Comment 5 Loren Wilton 2004-03-06 23:41:30 UTC
I would be somewhat in favor of #6.  It would at least provide the required 
functionality.  However, it has the distinct drawback that it must do the join 
on every invocation of a text that uses that bodypart, at least as you have 
diagrammed it.  Perhaps some more clever code that would do the join on the 
first use of the :all functionality within a given mimepart, and would be smart 
enough to use the existing concatenation if a second test referenced :all (as 
might quite likely be the case).

Part of the problem here is semantic, although the more important part is the 
actual lack of a suitable item to test.  The semantic part is that 'body', at 
least to me, implies something much greater than a single line or paragraph *of 
the body*.  So the "proper" solution would be to make 'body' mean "all of the 
decoded body (of the current mimepart)" and 'paragraph' mean what 'body' now 
actually represents.  Of course, while this would be 'proper' it isn't going to 
fly because of all the tests, both released and developed by others, that would 
have to change.  So the alternative is a new keyword, a new modifier, or 
possibly a tflags (which I know little about) option.

On the subject of deleting 'full', I believe that there currently at least two 
missing objects to test against.  Development of these objects could well 
eliminate the need for both full and rawbody.

The first missing test object is the subject of this report: a full 
concatenation of the 'body' parts relevent to a single mime part of the 
message.  The primary use for this part would be textual checks that need to 
span multiple lines, and want the html encoding eliminated before the tests.

The second missing part is a de-binhexed, d-printed-quotable, de-base64ed 
representation of a single mime part.  This would be perhaps essentially the 
concatenation of the rawbody lines for this mimepart.  The primary use of this 
part would be to do html checks that may have to span multiple lines.

Note that both of these parts perhaps should retain the newlines, in case those 
provide appropriate clues to the tests to run on them.  A /s is simple enough 
to use if the newlines are to be ignored.

Number of "body:all" parts in a message:

There should be one "body' for each processed mimepart.  A text-only message 
will have one body.  An html-only message will have one body.  A text and html 
message will have two bodies.  A text, html, and binary attachment will still 
have two bodies, because the binary attachment is discarded.

There should NOT be a single 'body' for all mimeparts of the message 
concatenated.  Or if there is, it should be something separate from what is 
being discussed here.

Comment 6 Daniel Quinlan 2004-08-27 17:19:52 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 7 Justin Mason 2005-03-11 15:09:59 UTC
lowering priority -- this would be an RFE
Comment 8 Daniel Quinlan 2005-04-03 01:18:12 UTC
moving to Future milestone
Comment 9 Fred T 2006-05-15 17:23:25 UTC
With the changes in rawbody for 3.2, won't this be resolved?
Comment 10 Theo Van Dinter 2006-06-27 17:45:54 UTC
I'm considering this resolved.