SA Bugzilla – Bug 3131
RFE: add an ":all" modifier to body/rawbody to match entire message body text as 1 string
Last modified: 2006-06-27 10:45:54 UTC
I have a class of spams that appear sometimes in html and sometimes in plain text. A characteristic of all of these spams is a signature in the form: Regards, Alpha Foo (or other 2-word random name) In html: <P>Regards, <P>Alpah Foo <P> Or: <P>Regards,<BR> Alpha Foo<BR> Seemingly a simple body test should catch this: body BOGUS_SIG /\bRegards, [A-Z][a-z]+ [A-Z][a-z]+\b/ This in fact works on the 20% or so of these spams that are plain text. It does NOT work on any of the HTML spams, because "body" breaks into multiple hunks at <P> marks or <BR> marks! Thus, "body" in html is approximately as useless as rawbody when trying to find a specific sequence of words that will match and might span a line break. Now, it will be argued (erroneously) that this can be handled by a meta, and thus splitting the body isn't a problem: body __REGARDS /\bRegards,/ body __SIG /\b[A-Z][a-z]+ [A-Z][a-z]+\b/ meta BOGUS_SIG (__REGARDS && __SIG) I leave it to the reader to figure out why that one won't work. Thus, the ONLY current way to detect this particular spam signature is to use FULL. Which means that the regex now has to parse html and line breaks. And will fail if the body is encoded in quoted-printable or base64. Or if the text appears in a header. Or probably any of another possible obfuscations or erroneous hits. It is argued that 'body' needs to be separate pieces to reduce regex overhead. I argue that forcing simple body tests onto full (which is by definition a larger hunk of text than the combined body of the message) INCREASES overhead, both due to searching a larger text string, and because the tests themselves become very convoluted to attempt to un-encode all of the various ways that a body can be encoded. Thus the body decoding has to be done multiple times. Another argument would of course be that we don't need to detect obvious spam signatures, since being able to look for them would perhaps increase SA overhead. The rejoinder to that is, what the heck is the purpose of SA if not to detect spam by its characteristics? Do we expect the spammers to purposely code their spams to make them easy for SA to detect?
I would certainly not like to see this rule added as I've just tried it locally, and it catches a common way I, and others, sign their emails at times. "Regards, Colin Ogilvie" Also, it doesn't catch the example you gave here of Regards, Alpha Foo which would catch EVERY email I sent from my home computer and a lot of the ones I get from companies etc.
This rule was not intended to be used standalone, but as part of a meta to catch the particular spams. These spams have about 4 identifying characteristics that between them are quite unique. Loosing any of the parts of the meta increases the likelyhood of a FP considerably. As of last night, the combined meta has been catching my spams with no FPs on a 100K corpus. The particular case that I had to code with a full body search does indeed produce many FPs by itself. The new strain of spams we are starting to see have very small bodies, often in plain text, and do NOT contain all of the usual obfuscation tricks that are so easy to detect and filter out. Filtering them can be quite tricky, and the possibility of FPs higher than on more traditional spam, so the rules have to be watched carefully and tweaked as necessary rather than thrown in a box and forgotten. Note I am NOT proposing this rule as part of the SA main tests; it would not be approprate. It IS however appropriate for this current particular spam when used with other tests.
The current rule is full __REGARDLESS_SIG m|Regards,[\s\r\n<pP>]{0,10}[A-Z][a-z]+\s[A-Z][a-z]+|
Well, here's my 2 cents: 1) it used to be that the tests would run over the whole thing at once. that blew up horribly due to RE issues/backtracking/etc. that's when we went to doing things by paragraph. 2) besides your example, the current system works very well. 3) you're upset that <P> makes lines treated as different paragraphs, but that's what <P> does -- makes different paragraphs. 4) "Do we expect the spammers to purposely code their spams to make them easy for SA to detect?" The answer of course is no, but what you're looking for isn't really a spamsign, so ... 5) I agree that full is not appropriate for "simple" tests, that's not what it's meant to be used for. I'd actually like to get rid of "full" since nothing we have uses it, and if you're resorting to full there's likely better ways to search for what you want. 6) I wouldn't mind adding in some form of ":all" functionality for body/rawbody which would basically do a join("",@array_of_paragraphs) && s/\s+/ /g before doing the test. That way people who want to try it can, and if they blow their own system up with bad RE -- well, that's not our problem. ;)
I would be somewhat in favor of #6. It would at least provide the required functionality. However, it has the distinct drawback that it must do the join on every invocation of a text that uses that bodypart, at least as you have diagrammed it. Perhaps some more clever code that would do the join on the first use of the :all functionality within a given mimepart, and would be smart enough to use the existing concatenation if a second test referenced :all (as might quite likely be the case). Part of the problem here is semantic, although the more important part is the actual lack of a suitable item to test. The semantic part is that 'body', at least to me, implies something much greater than a single line or paragraph *of the body*. So the "proper" solution would be to make 'body' mean "all of the decoded body (of the current mimepart)" and 'paragraph' mean what 'body' now actually represents. Of course, while this would be 'proper' it isn't going to fly because of all the tests, both released and developed by others, that would have to change. So the alternative is a new keyword, a new modifier, or possibly a tflags (which I know little about) option. On the subject of deleting 'full', I believe that there currently at least two missing objects to test against. Development of these objects could well eliminate the need for both full and rawbody. The first missing test object is the subject of this report: a full concatenation of the 'body' parts relevent to a single mime part of the message. The primary use for this part would be textual checks that need to span multiple lines, and want the html encoding eliminated before the tests. The second missing part is a de-binhexed, d-printed-quotable, de-base64ed representation of a single mime part. This would be perhaps essentially the concatenation of the rawbody lines for this mimepart. The primary use of this part would be to do html checks that may have to span multiple lines. Note that both of these parts perhaps should retain the newlines, in case those provide appropriate clues to the tests to run on them. A /s is simple enough to use if the newlines are to be ignored. Number of "body:all" parts in a message: There should be one "body' for each processed mimepart. A text-only message will have one body. An html-only message will have one body. A text and html message will have two bodies. A text, html, and binary attachment will still have two bodies, because the binary attachment is discarded. There should NOT be a single 'body' for all mimeparts of the message concatenated. Or if there is, it should be something separate from what is being discussed here.
more accuracy and performance bugs going to 3.1.0 milestone
lowering priority -- this would be an RFE
moving to Future milestone
With the changes in rawbody for 3.2, won't this be resolved?
I'm considering this resolved.