SA Bugzilla – Bug 5041
do not use body/rawbody rules on CType 'message/partial'
Last modified: 2011-05-31 23:23:49 UTC
Pasting from a users thread from Mark Martinec: -------------------------------------------------------------------- I recently noticed a couple of cases where SA (3.1.4 or earlier) would take over a minute (instead of few seconds) to check a 500 kB message. Investigation reavealed that cases have one thing in common: these were all message/partial chunks of a longish transfer of some document or other data. Moreover, most of these cases were hitting random sets of SARE or baseline rules, yielding false positives. In case someone would suggest that Content-Type: message/partial should be banned outright - well, it is a policy decision, and if allowed, should not bring SA to its knees on a 0.5 MB message. Here is one example where a command-line 'spamassassin -t -D' would run for 68 seconds. Timestamping each debug line produces the following top-10 lines - sorted by elapsed time, first column is time in seconds for this line to appear after a previous one: 1.935 dbg: rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC" 2.204 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134" 3.695 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0il" 3.976 dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "i" 4.021 dbg: rules: running raw-body-text per-line regexp tests; score ... 6.397 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " Sjx" 8.225 dbg: bayes: tok_get_all: token count: 37175 8.254 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169" 9.682 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218" 11.999 dbg: rules: running body-text per-line regexp tests; score so far=2.501 and another example: 2.396 dbg: rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "b0y" 2.424 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134" 2.627 dbg: bayes: tok_get_all: token count: 36631 3.421 dbg: rules: running body-text per-line regexp tests; score so far=0.203 3.826 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0Il" 4.181 dbg: rules: running raw-body-text per-line regexp tests; score ... 4.265 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " S8X" 8.113 dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "XoNOgX" 9.308 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169" 9.945 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218" I know some of these are SARE rulesets, but some are baseline rules or bayes token parsing. Here is a relevant section/sample of one of these messages: MIME-Version: 1.0 Content-Type: message/partial; total=22; id="01C6BB9C.7D698F00@zogica"; number=21 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.2869 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869 f6idzxqa608aID8+YhwNSQwBpIrboHA0/zPfOP26mB6eONz70Xl12DwGVnAPemaaKaJyQk5ZKUwg VC0sGYHLd543cICNa1piu8YgRJR0EaEK7GNVXvFSriat5dZwj7PNzQuOTO030bra7tBjROxbrVYR XFStjnugVkyH27zqrvUdUsHYnLaVLdUuAxWH51QDV9/kc6vtIURcdUbthPszq12lj7Lt7rMAtVX7 So the problem is that these base64-encoded lines in a message/partial chunk are treated as obfuscated text, which is very slow, and produces almost random hits on various rules. It also places some burden on SQL server (bayes: tok_get_all: token count: 37175). Somewhat similar mail cases that also hit various obfuscation rules because of its UU-encoding being mistaken for a plain text, is mail with attachments produced by Microsoft Office Outlook where user has the following setting chosen: Tools -> Options -> Mail Format -> Internet format: plain text options: (YES) Encode attachments in UUENCODE format when sending a plain text message It would be nice if such encodings were recognized and at least prevent rules that expect plain text from running and/or producing false hits. Mark -------------------------------------------------------------------- When I run a scan on my laptop here, using svn trunk and the default ruleset, it takes 25 seconds; still pretty slow. Issue #1: I guess this comes down to how a message/partial is treated in common MUAs; as far as I can see, it's not displayed as text, therefore we shouldn't scan it as text. Issue #2: A side issue is that the ReplaceTags rules perform pretty badly on 500Kib files with 78-char, no-space lines. Issue #3: an escape for UUEncoded messages. We used to have this, but removed it since it slowed down the common case to deal with the extremely rare case -- I seem to recall we checked our corpora, and none of us had a single UUE'd message in over 5 years or so. Has anyone used UUE in years? If not, I'm -1, even if Outlook stupidly still supports it. (If we were to design SpamAssassin based on MS product decisions, we'd be in as much of a mess as they are.) Mark -- may I upload that sample to this bug? Without it, everyone else will be unable to reproduce the issue, test fixes, etc.
Created attachment 3632 [details] sample message/partial message here's the sample (thanks Mark!)
Outlook does indeed create UUE mails under a number of circumstances in its various versions. Most of the cases where it will do this are beyond the user's control, and probably beyond the Exchange Administrator's (if there is one) control. I think there was a bug thread I started a year or two ago about trying to either decode UUE and check for an image, or simply ignore UUE for the case of text rules, because the obfuscation rules fall all over themselves as soon as they hit UUE. I know the resolution was either invalid or wontfix, but I still think it was the wrong resolution. Hum, how about a 'tflags obfu' type of flag to flag rules that shouldn't be applied to encoded data?
If the problem is that various rules that expect to look at words are slowed down by blocks with no spaces, could a solution be to have one piece of code that tokenizes the body into a data structure that can be used by all word-oriented body rules? That one piece of code could skip large blocks without spaces. How does that fit in with the processing that is already done to prepare the input for body rules?
Created attachment 3633 [details] Sample mail with uuencoded attachment from MS Office Outlook The attached 410 kB mail was generated by MS Office Outlook with the setting: "Encode attachments in UUENCODE format when sending a plain text message", and takes 100 seconds for checking with SA 3.1.4 + sa-update + basic SARE rules + Bayes@SQL, collecting plenty of false positive hits on various rules: WEIRD_QUOTING, FB_DOLLAR_ASS2, OBSCURED_EMAIL, FUZZY_MILF, FB_NOT_SEX, SARE_OBFU_NUMBERS, SARE_PROLOSTOCK_SYM3, SARE_OBFU_PART_ALI, ML_MARKETING, SARE_ADLTOBFU, SARE_RMML_Stock1, SARE_HTML_URI_LHOST31, SARE_URI_EQUALS, BAYES_99, HG_HORMONE, DRUGS_MUSCLE, SARE_RAND_2, SARE_RAND_6, UPPERCASE_50_75
(In reply to comment #4) > Sample mail with uuencoded attachment from MS Office Outlook P.S., as a curiosity, here is again a list of slowest rules (sorted) when applied to the attached message, column 1 is elapsed time in seconds: 1.365 rules: ran body rule __DRUGS_MUSCLE1 ======> got hit: "&S[0M!@" 1.608 rules: ran body rule SARE_ADLTOBFU ======> got hit: "B0Y" 1.609 rules: ran body rule SARE_OBFU_NUMBERS ======> got hit: "$8OB" 2.079 rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC" 2.129 rules: ran body rule FUZZY_MILF ======> got hit: "M?Y%!6F" 2.322 rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218" 2.747 rules: ran rawbody rule __OBFUSCATING_COMMENT_A ======> got hit: ... 2.888 rules: ran body rule SARE_PROLOSTOCK_SYM3 ======> got hit: "NNOS" 4.129 rules: ran body rule OBSCURED_EMAIL ======> got hit: "I^=0.NY.E9... 5.828 rules: ran body rule SARE_OBFU_PART_ALI ======> got hit: " MGFA1IT-" 6.804 rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169" 7.863 rules: ran body rule FB_NOT_SEX ======> got hit: " S=X" 12.893 rules: running body-text per-line regexp tests; score so far=-1.797 28.586 bayes: tok_get_all: token count: 106072 Note that Bayes processed 106072 tokens.
Ignoring the obfuscation bits for a minute, a first reaction is that currently when grabbing text to process, we look for /^(text|message)\/) parts, since we may have attached message/rfc822, etc, bits that we should look at for rules. However, message/partial should probably be skipped, assuming that the part doesn't say there's only 1 part. I don't really think scanning a partial message is a good idea. My second reaction is: "I recently noticed a couple of cases where SA (3.1.4 or earlier) would take over a minute (instead of few seconds) to check a 500 kB message." It's well documented that people should limit the size of messages passed to SA, and the current suggested max is ~250KB. Whether or not SA should handle larger messages and how to get there is a different discussion.
(In reply to comment #6) > ...and the current suggested max is ~250KB. I know, our limit was 512 kB (increased as some bigger spams started appearing in recent months), and the sample message was just below this limit.
'the current suggested max is ~250KB' that's a distraction. if you cut the message down to fit into 250k, it still takes a dangerously long time -- 19.4 secs vs 25, secs on my unloaded laptop, for example. let's not get sidetracked ;) (btw those times don't include bayes, fwiw)
In early days when Postfix header_checks and body_checks were still effective, folks had similar problems, and the solution was to prepend a condition on such regexp sets, so that lines entirely made up of base64 (or uue) characters of certain common encoding sizes would let the more refined (and expensive) tests to be skipped. Checking if a line looks like base64 or uue line should be relatively quick. Don't know how that applies to SA.
> Checking if a line looks like base64 or uue line > should be relatively quick And even quicker and easier if that was part of the part of the preprocessing of the raw message body to pass to the message body rules. We don't even have to insert spaces or delete BASE64 lines if we add a flag that says "long lines with no spaces" and allow a rule to specify (with a tflag maybe?) that it should skip those. Or is that getting too specialized to justify a new tflag?
FWIW, the "body" rendering was explicitly designed to be a form of the body that doesn't contain such non-text noise parts -- in order to avoid hacks like exclusion patterns at the start of REs.
> non-text noise parts Exactly, and it produces an array of "short" lines that are limited to 2048 characters to prevent overloading rules without them having to deal with line length individually. Perhaps we should say that message text is logically a set of words, and instead of an array of short lines produce an array of short words. Ok, I've thought about this a bit more and I'm leaving the previous paragraph I typed to provide context for how I'm thinking about this: The first attached example has a block of 76 character lines with no spaces. If we don't want to break up long URLs there may not be a "short" word length that would do us any good. Even worse, looking at the second example, with the uuencoded block, the lines are only 64 or so characters long and there are some embedded spaces, yet it still takes too long to process. What we haven't done is profile the slow rules to see just what the bottleneck(s) is/are. If there is a bottleneck common to all of those rules, once we know what it is we can either do something in message body processing or come up with some standard thing to do in such rules to avoid it. If we can't do that we may be stuck with figuring out a heuristic for detecting BASE64 and uuencoded blocks and not pass them through into the message body array -- But then we have to be very careful that spammers can't trick the parser to get it to allow text through that will render ok on the mail client.
> If we can't do that we may be stuck with figuring out a heuristic for detecting > BASE64 and uuencoded blocks and not pass them through into the message body > array -- But then we have to be very careful that spammers can't trick the > parser to get it to allow text through that will render ok on the mail client. I don't think this is possible, see my comments in the last ticket that talked about uuencoded bits: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3278 If we start ignoring sections of text or trying to decode it, or ... then we give the spammers a hole to drive a truck through, be it a pickup or a semi.
I agree, Theo. When I say "but then we have to be very careful" I am pessimistic about coming up with a heuristic that is careful enough. I think the best next step is to profile the rules that are slow with these messages to see if there is a single bottleneck that indicates a simple fix.
(a) we used to have code to detect UUE bits in the body-rendering code. we took it out, assuming it was obsolete! ;) (b) it is indeed viable to ignore sections of the text; for example, if the body text contains 50000 lines of 76-char "words", in a text/plain format, I would say we could safely assume that the latter 49700 lines don't need to be scanned using the "body" rules, if we so desired. This is because once rendered in a user's MUA, they won't be visible unless the user spends some time scrolling down -- in other words, they'd be a useless place to hide a spam payload. HTML is different, of course, since there are several ways to move content from near EOF to the top of the MUA display.
I just noticed: The debug logs in the previous comments don't say which rules are taking time. Only rules that hit are printed. Isn't there a script for profiling all rules? > HTML is different Luckily, none of these would be confused with HTML. I like the idea of ignoring anything in a BASE64 block after the first 300 or so lines. So where is that UUE detecting code and how proof is it against spoofing?
(In reply to comment #16) > I just noticed: The debug logs in the previous comments don't say which rules > are taking time. Only rules that hit are printed. Isn't there a script for > profiling all rules? You enable profiling in perl (perl -d:DProf ...) and then run dprofpp (see the man page) and you can see the list. When I ran it, it looks like the main rules are, unsurprisingly, the FUZZY_* list. > So where is that UUE detecting code and how proof is it against spoofing? OMG, he wants to go back to the 2.x code, nooooooo! In a quick look around, the code in question was in PerMsgStatus which is horribly trivial to bypass (this snippet in the loop generating the body text): foreach my $line (@{$textary}) { if ($uu_region == 0 && $line =~ /^begin [0-7]{3} .*/) { $uu_region = 1; next; } if ($uu_region) { if ($line =~ /^[\x21-\x60]{1,61}$/) { # here is where we could uudecode text if we had a use for it # $decoded = unpack("%u", $line); next; } elsif ($line =~ /^end$/) { $uu_region = 0; next; } # any malformed lines get passed through } $_ .= $line; }
Profiling on the second example showed that most of the time is spent on check_unique_words, and then various DRUG_PAIN and FUZZY_ rules. The problem is that no one of them takes all that much time, but they each take enough that it adds up to a lot. The thing they have in common is that they are matching regexp patterns with multiple things like [_\W]{0,3}. It isn't the \W in particular, but something that gets a lot of matches per line and is variable length. Experimenting, I didn't see enough difference to be worth bothering with by timing matches using .{0,3} or \.{0,3} so it isn't just the character class that is being used. Unless someone knows of a way to do that kind of matching significantly faster, I don't see what we can do to make the rules run faster on full message bodies, which leaves Justin's suggestion from comment #15
Created attachment 3655 [details] Another example, this time just a plain text Here is another related example that floated-by today. This time it is a plain text (no encodings), a 736 kB cron report from a mirror host. It takes 200 seconds, mostly on SARE rules and Bayes SQL (token count: 15445). Again a top-10 list of debug lines that took longest to appear when running spamassassin -t -D, first column is time in seconds: 4.144 rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134" 7.151 rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "p0rn" 7.661 rules: ran body rule SARE_ADLTOBFU ======> got hit: "p0rn" 8.219 rules: running raw-body-text per-line regexp tests; score so far=6.024 9.710 rules: ran body rule __SARE_SPEC_PROLEO3 ======> got hit: "3.00" 10.189 rules: ran body rule __SARE_SPEC_PROLEO1 ======> got hit: "1.20" 12.015 rules: running body-text per-line regexp tests; score so far=0 14.534 rules: ran body rule SARE_OBFUPORNO ======> got hit: "p0rn-" 33.351 rules: ran body rule __SARE_SPEC_LRD_COST2 ======> got hit: "1.21" 75.123 bayes: tok_get_all: token count: 15445 I don't have any good suggestions :-/
Mark, some comments in response to your comment 19 : The example is bigger than our recommended maximum size. Also, the example is something that should be whitelisted. In fact, it's probbly a good idea to have local mail from cron jobs not run through SpamAssassin at all. Of course that doesn't change the fact that this is an example of a message that takes 200 seconds to process, which means it could provide insight on what needs to be done to prevent, say, 100 second messages that can't be filtered out so easily. It does demonstrate that the problem is not restricted to BASE64 or UUE encoding and so we cannot simply try to detect those to fix it. That's good in a way, because as has been pointed out we can't easily filter for UUE encoding and this way we won't get frustrated trying to do that. Running spamassassin with -D and looking at the time stamps does not show which rules are taking the time. The only rules that show up there are the ones that hit. You should use profiling as Theo described in comment 17 Some of this could be helped by the xs work Justin is doing. That doesn't help Bayes tokenizing. Does it make sense to use an xs for that? Can that be sped up in pure perl? Another approach is Justin's comment 15 on looking at only the first n lines of non-HTML text, which is what I'm leaning towards now.
> The example is bigger than our recommended maximum size. Oops, sorry, my mistake - last time I fiddled with the limit I made a typo, and didn't notice it even when typing my previous note. > Running spamassassin with -D and looking at the time stamps does not > show which rules are taking the time. ... You should use profiling as > Theo described in comment 17. Nod. (but it is a quick and simple way to concentrate on trouble areas)
I think we've missed one easy-to-fix part of this: message/partial Content-Types should not be scanned using body or rawbody rules. let's get that part out of the way at least...
(In reply to comment #22) > I think we've missed one easy-to-fix part of this: message/partial Content-Types > should not be scanned using body or rawbody rules. let's get that part out of > the way at least... Is it possible to send/render a message that is "split" into a single message/partial section (one email)?
this is ameliorated in 3.3.0 due to the fix for bug 5717, which splits the 'rawbody' representation into chunks of sizes between 1-2KB. However we still need to skip scanning of message/partials.
upping pri
the message/partial case is now fixed in trunk: : jm 84...; svn commit -m "bug 5041: do not render message bodies of MIME type 'message/partial'" Sending lib/Mail/SpamAssassin/Message.pm Transmitting file data . Committed revision 648864. patch for 3.2.x to follow.
Created attachment 4304 [details] fix for 3.2.x
I'm still curious about the question I posed in comment #23. Does skipping a message/partial part that actually contains the entire message open us up to an easy way to bypass body scanning?
(In reply to comment #28) > I'm still curious about the question I posed in comment #23. Does skipping a > message/partial part that actually contains the entire message open us up to an > easy way to bypass body scanning? It does -- I'll attach a sample to demo this -- but note that any use of message/partial will fire FRAGMENTED_MESSAGE, for 2.5 points. so for spammers, it'd be a question of how many body rules they could evade for a 2.5 point penalty -- and it has no cloaking effect on the higher-scoring header/network rules anyway...
Created attachment 4324 [details] a single-part message/partial test mail GMail renders this correctly, SA only sees the headers (but fires FRAGMENTED_MESSAGE)
This gets trickier. message/partial can also contain headers -- not just the message body. So a message/partial can override the To:, From: or Subject: header easily enough. hmm... I'm starting to not like the current proposal :( Should we change this algorithm? - for the first chunk of a message/partial, decode and render it correctly. - if the first chunk is less than some reasonable length threshold, fire an additional penalty rule. (This is to avoid spammers fragmenting a message into tiny chunks such that the first chunk contains nothing nasty) - if the first chunk contains just message headers but no body, fire another penalty. (this is to avoid spammers fragmenting so that the "real" body appears in later, ignored chunks) - for the second and later chunks, ignore but fire FRAGMENTED_MESSAGE. ugh, this is tricky. suggestions?
I think we should push this off to 3.2.6 (if any 3.2.x release gets it), due to lack of a conclusive plan of action and possible dangers of opening a loophole...
Removing the Review and whiteboard status due to comment #32
I tried to think about this some more, but my brain hurt. ;) The more I think about it, the less I like message/partial. I think we should increase the score for FRAGMENTED_MESSAGE to something like 4 points, and add an additional rule that matches message/partial mails containing embedded override headers -- imo, they are just nasty and dangerous.
Close. Fix committed to trunk before 3.3 was branched. There were side issues being discussed, but no comments in 2.4 years. (In reply to comment #26) > the message/partial case is now fixed in trunk: > > : jm 84...; svn commit -m "bug 5041: do not render message bodies of MIME type > 'message/partial'" > Sending lib/Mail/SpamAssassin/Message.pm > Transmitting file data . > Committed revision 648864. > > patch for 3.2.x to follow.
fixed