|
SA Bugzilla – Full Text Bug Listing |
Summary: | fast_body, fast_rawbody, and tmethod support | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Dallas Engelken <dallas> |
Component: | Rules | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | antispam, apache |
Priority: | P4 | ||
Version: | SVN Trunk (Latest Devel Version) | ||
Target Milestone: | Undefined | ||
Hardware: | All | ||
OS: | other | ||
URL: | http://www.engelken.net/download/range.patch | ||
Whiteboard: | |||
Attachments: |
fast_body, fast_rawbody, and tmethod
'range' patch |
Description
Dallas Engelken
2005-11-17 21:56:50 UTC
Created attachment 3256 [details]
fast_body, fast_rawbody, and tmethod
implements fast_body, fast_rawbody, and tmethod options.
*cough* Plugin! *cough* Very interesting ideas. However, this functionality really needs to be a plugin. Changing the body code itself to do this is a bad idea. It makes the code much harder to maintain and there may be other ways to speed up tests (for example, index()-based matching). i dont see any way to pluginze this and make it work properly. if you pluginize the 'tmethod' property, what good does that do if PMS::do_body_tests () cant read it? what happens to body rule priorities if you have to eval them through a custom plugin?? if I want to create some fast_body rules and want them to be the first thing that runs to increase efficiency, i cant because they are evals which occur after body and rawbody rules. $self->do_body_tests($priority, $decoded); $self->do_body_uri_tests($priority, @uris); $self->do_body_eval_tests($priority, $decoded); $self->do_rawbody_tests($priority, $bodytext); $self->do_rawbody_eval_tests($priority, $bodytext); try to look past the 'fast_body' and 'fast_rawbody' definitions, as those are just quick wrappers around the 'tmethod' code which is the meat and potatoes of the patch ;) "if I want to create some fast_body rules and want them to be the first thing that runs to increase efficiency" ---- realize i say this in advance of the "yet-to-come-short-circuit" testing... run the fastest rules first that you can short-circuit on to increase efficiency. body WANNA_FSCK /wanna fsck\?/ descript WANNA_FSCK 'wanna fsck' in first 1kb of data tmethod WANNA_FSCK stringtrim 0:1024 tflags WANNA_FSCK shortcircuit score WANNA_FSCK 100 ;) btw, I swapped a couple of mails with Dallas about this -- worth noting that it does provide a significant speed-up -- at least in the original patch. (However, I'm concerned that this version, where 'sub '.$rulename.'_body_test' now includes a lot of body array-manipulation code, may remove that speed improvement... some more benchmarking/moving code about may be necessary with this one.) I'm in favour of the proposed command syntax, although I'd prefer "range" instead of "tmethod" if possible. Normally I'm the biggest fan of pluginizing. However, I think this functionality is valuable in the core -- unless Daniel or Michael can propose a good, fast-enough plugin API for it. ;) I suspect a lot of the speed increases would be minimized if this were to be pluginized. Having said that, I think we need to see some performance numbers (DProf) to see what kind of a speed increase we are talking about (and what the performance hit is if we don't use any tmethod rules). i have yet to do performance testing on this, but i plan to. the performance testing i did previously, was when i implemented a 'tflag fast' method, which basically just took a config value for fast_body_max_bytes and fast_rawbody_max_bytes, did a substr(content,0,<max_bytes>) and performed a single regexp test against that block of content, instead of per line regexp. the speed up was quite good depending on the max_bytes values that were set. even when the max_bytes for body and rawbody were set to 256kb, it was slightly faster to use a single regexp test. i'll just paste those results here.. it was sorta crude testing, but gave me some hope to continue. :) * no tflags fast on body/rawbody rules --------------------------------------------------------------- total scantime: 570.591863 seconds number of files scanned: 740 avg scantime: 0.7707657 seconds --------------------------------------------------------------- And now with 'tflag fast' set on all body and rawbody rules.... * fast_body_max_bytes 262144 * fast_rawbody_max_bytes 262144 --------------------------------------------------------------- total scantime: 552.70254 seconds number of files scanned: 740 avg scantime: 0.7465846 seconds --------------------------------------------------------------- * fast_body_max_bytes 8192 * fast_rawbody_max_bytes 32768 --------------------------------------------------------------- total scantime: 470.123985 seconds number of files scanned: 740 avg scantime: 0.6349819 seconds --------------------------------------------------------------- * fast_body_max_bytes 4096 * fast_rawbody_max_bytes 16384 --------------------------------------------------------------- total scantime: 454.957886 seconds number of files scanned: 740 avg scantime: 0.6145058 seconds --------------------------------------------------------------- as you can see, a 4k body and 16kb rawbody resulted in about a 20% speedup... but this was against the entire ruleset running in tflag fast mode. in reality, thats not something that would ever be done because of loss in hit rates, but i think the point here is that it even proves faster at the 256kb level. although i'm sure there are people that run spamc -s with values larger than 256kb, and performance at some point will get worse over that. so, to justin's point, the increase in size of the function by rule does effect the memory footprint for one... hopefully it doesnt effect the speed. and that footprint will grow more with the addition of custom rules/rule sets. memory usage, stock svn 7658 59.0 1.7 22692 18480 ? Ss 17:39 0:00 /usr/bin/spamd -d memory usage, svn + tmethod patch 7593 0.4 1.8 24556 18956 ? Ss 17:37 0:00 /usr/bin/spamd -d to duncans point, if you "dont use any tmethod rules", then i guess there is no point in adding this to the core. unless the point is to extend the functionality for rule writers. Actually, extending rule writing functionality for SARE is the main reason i'm doing this.. most notably the multiline rawbody rules which are impossible to write efficiently now since full /s testing is just too expensive. dallas "I'm in favour of the proposed command syntax, although I'd prefer "range" instead of "tmethod" if possible." i chose tmethod (test method) to complement tflags (test flags) because they both alter the way the test actually works. if you use 'range', what range would that refer to?? byte range, line range? do you keep syntax range RULE_NAME <method> <args> or do you drop the methods param... range RULE_NAME <start_pos> <bytes> does the above only apply to block style single regexp tests and skip per-line testing altogether? do you use range RULE_NAME line <start> <stop> range RULE_NAME byte <start_pos> <bytes> how do you solve the problem of taking newline and extra whitespace out of rawbody content.. another param completely? there are many ways one could go here... The testing of the method should be done outside $evalstr2. The method won't change across different messages, so there's no need to put the either the test or the code for the unused methods inside the function defined by $evalstr2. With a large number of "fast_body" rules, SpamAssassin will be spending time repeatedly reconstructing the same string to be searched against. I thought the reason that body and rawbody were split into arrays was to protect against worst-case behavior of test regexes. By pasting those lines back together, this patch appears to forgo that protection. In my opinion, this patch adds way to many knobs to tweak. I don't find the less than 20% performance improvement that compelling, especially given it hands spammers an easy way to prevent SpamAssassin from firing rules on their messages. > I thought the reason that body and rawbody were split into arrays was to protect > against worst-case behavior of test regexes. By pasting those lines back > together, this patch appears to forgo that protection. That is the theory. The practice is that spammers have several very trivial patterns that are guaranteed to be impossible to detect in SA, short of writing a separate plugin for each thing you want to test. Essentially, any HTML-level pattern containing one or more line breaks is undetectable. The only current solution is to run 'full' rules over the entire email to test for those patterns. This of course will only work with non-encoded mail bodies. It is perhaps also worth noting that by refusing to allow multiline rawbody, and arbitrarily and unpredictably breaking 'body', you are forcing the use of 'full' rules for things that could be trivially parsed in connected body or rawbody rules. The only conclusion can be that the dev's believe that is is safer and faster to parse the entire email with a regex than it would be to parse a small but useful part of the email with the same regex. okay.. new patch. it now works properly with 'tflags multiple' no matter if you use 'line' or 'byte', and it changes the config from 'tmethod' to simply 'range' like justin requested. i've also moved most of the code out of the $evalstr2 definition into its own sub called get_range_data() which saves quite a bit on memory consumption. ------------------------------------- body __TEST /test/ range __TEST byte 0:512 tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26087] dbg: range_data: type=byte args=0:512 for rule __TEST [26087] dbg: rules: running range type byte [26087] dbg: rules: found "test" at pos 5 - hit #1 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: found "test" at pos 10 - hit #2 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: found "test" at pos 15 - hit #3 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: __TEST got 3 hits ------------------------------------- body __TEST /test/ range __TEST byte 0:10 tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26073] dbg: range_data: type=byte args=0:10 for rule __TEST [26073] dbg: rules: running range type byte [26073] dbg: rules: found "test" at pos 5 - hit #1 [26073] dbg: rules: ran body rule __TEST ======> got hit: "test" [26073] dbg: rules: found "test" at pos 10 - hit #2 [26073] dbg: rules: ran body rule __TEST ======> got hit: "test" [26073] dbg: rules: __TEST got 2 hits ------------------------------------- fast_body __TEST /test/ tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26096] dbg: range_data: type=bytetrim args=0:8192 for rule __TEST [26096] dbg: rules: running range type bytetrim [26096] dbg: rules: found "test" at pos 5 - hit #1 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: found "test" at pos 10 - hit #2 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: found "test" at pos 15 - hit #3 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: __TEST got 3 hits ------------------------------------- # echo -e "From: you\nContent-type: text/html\n\n<html>\n<body>\n\ntest<br />\ntest<br />\ntest<br />\n</body>\n</html>" | spamc fast_rawbody __TEST /test/ tflags __TEST multiple score __TEST 1.0 [26175] dbg: range_data: type=bytetrim args=0:32768 for rule __TEST [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" ------------------------------------- Created attachment 3273 [details]
'range' patch
implements 'range', 'fast_body' and 'fast_rawbody' methods.
well.. i've done my speed test with and without the patch. Average scan times were better with the patch than without... dont know why. memory usage goes up as expected.. starting spamd in both cases as.. # spamd -d -L -s mail ------------------------------------------------------------ Memory & Speed of Stock SVN (3.2.0-r322462) root 31237 54.2 2.4 31448 25236 ? Ss 11:21 0:02 /usr/bin/perl -T -w /usr/bin/spamd -d -D -L -s mail root 31240 0.0 2.4 31448 25228 ? S 11:21 0:00 spamd child root 31241 0.0 2.4 31448 25228 ? S 11:21 0:00 spamd child total scantime: 761.461411 seconds number of files scanned: 987 avg scantime: 0.7708705 seconds ------------------------------------------------------------ Memory & Speed after applying range patch... (3.2.0-r322462) root 890 33.8 2.7 33468 28612 ? Ss 12:01 0:02 /usr/bin/spamd -d -L -s mail root 892 0.0 2.7 33468 28612 ? S 12:01 0:00 spamd child root 893 0.0 2.7 33468 28612 ? S 12:01 0:00 spamd child total scantime: 756.437461 seconds number of files scanned: 987 avg scantime: 0.7657802 seconds ------------------------------------------------------------ other benchmarks welcome... thanks. followup to comment 14 no rules were applied using any range parameters... the test was just to see what overhead the extra code added. with this patch running, and testing various SARE rulesets, i'm seeing thousands of Malformed UTF-8 warn messages... [8498] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xcf) in pattern match (m//) at /usr/share/spamassassin/70_sare_adult.cf, rule SARE_OBFUPUSS, line 1, <GEN10> line 266. when i remove this regex test, they go away... while ($fb =~ '.$pat.'g) { is there a better way to do this?? i see the regex test in 'sub do_full_tests' uses while ($$fullmsgref =~ '.$pat.'g) { yet it never produces UTF-8 warnings when I change those body rules (ie SARE_OBFUPUSS) from type body to type full. is this because of the way the $evalstr is built up in do_full_tests()? any recommendations to avoid these UTF-8 warns? thanks... What are the chances that you are concatenating a bunch of separate strings, and somehow ending up with the null terminators in the compound string? Sounds like that might produce this kind of warning. More likely when you truncate $fb, you somehow do it so the last UTF-8 sequence ends mid-byte on some of your test case messages. i've commented out the substr of $fb, and changed the regexp to match a constant string, and still see the Malformed UTF-8 # $fb = substr($fb,$arg1,$arg2); '.$self->hash_line_for_rule($rulename).' pos $fb = 0; #while ($fb =~ '.$pat.'g) { while(q#blah# =~ '.$pat.'g) { even if i change it to while(q#blah# =~ /this is a test/g) { i'm still seeing malformed utf-8 in debug... so this must be coming from some where else. also, just saw a report on the SARE user list this morning about "hundreds of malformed utf-8 errors on svn r349275". John's theory sounds very plausible to me. can we get a list of perl versions that this occurs with? UTF-8 errors are very variable based on perl release. also, is it always 'unexpected non-continuation byte 0x00'? 0x00 would indicate a UTF-8 char chopped off at end of string. full debug logs may be enlightening. The SARE rule-related warnings are being tracked in bug 3787. They reportedly happen even without this patch. yes.. it was a premature diagnosis. the utf-8 warns are content specific... and happen even against the if ('.$pat.') code. ping! just wondering what people think about this patch. I think we got distracted by the utf-8 issue... I'm about 130% in favor of this. If nothing else it would finally make rawbody a usable type for rule writers and we can get away from writing 'full' rules that will still only catch half the spams. i've been running it for 2+ months on a uri scraper that i use for uribl.com. and for about 4-5 weeks on ~1000 mtas. since i'm scraping alot of html data, i have to use lots of rawbody rules and it comes in quite handy. Something that I couldnt catch before was this <html> www.myspamuri.com </html> and now i can... rawbody HTML_URI_ONLY m'<html> ?(<body> )?(www\.)?[a-z0-9\-] {5,64}\.(com|net|info|biz) ?(</body>)? ?</html>'i range HTML_URI_ONLY bytetrim 0:256 couple things i'll note from my previous patch. the first dbg() call in get_range_data() causes lots of debug because its called per rule.. so that should be removed or commented out. also negative offsets supplied on a byte range do not work due to this line.. + if ($args && $args =~ m/(\d+)(:(\d+))?/) { should be + if ($args && $args =~ m/(\-?\d+)(:(\d+))?/) { This makes a rule like this start to work. body __FREEBIE_FOOTER /Home.{1,5}Disclaimer.{1,5}Privacy Policy.{1,5}Unsubscribe/i range __FREEBIE_FOOTER byte -256:256 i can make a new patch if necessary, but nothing else has changed. adding range checking to full rule types should probably be added as well. d I'm concerned that this patch convolutes two different issues. One is wanting to test specific parts of the text. I'm not sure how often this crops up (I've only *really* wanted it once and ended up not needing it), but it could be a useful efficacy addition. The other is the need for a better rawbody test. This seems like a good idea, but I'd prefer to just change how rawbody works. I think that's something we could do regardless. Make a new test type for rawbody that is more useful in general. Anyhow, the one time I wanted to test a specific range, an eval did the job fine. In other words, I still think this needs to be a plugin. If plugins don't allow what is needed, then we should try to address that. I should mention that this bug resembles bug 1050... my concern is that pluginizing this will break rule priorities since body evals are handled after body rules, rawbody evals are handled after rawbody rules. and if the primary reason for this is efficiency, prioritizing range based rawbody rules higher because they are faster would not be possible. i'm not sure how a pluginzied check() method will alter this functionality, but it would be nice to prioritize rules regardless of their rule type, tflags, meta dependancies, etc etc. With a few exceptions (meta and DNS) it is currently possible to prioritize rules regardless of rule type. Just assign them the priorities you want. Dallas, think you could whip up a patch for just adding tflag multiple into current SVN? This feature alone *might* be worth some value to us all.. fred, i believe tflag multiple support is already in svn. =item tflags SYMBOLIC_TEST_NAME [ {net|nice|learn|userconf|noautolearn|multiple} ] =item multiple The test will be evaluated multiple times, for use with meta rules. Only affects header, body, rawbody, uri, and full tests. Is this patch still feasible? I'd like to resurrect this idea in a slightly different manner: default limits. Something like this (assuming "string 0:0" removes any scope from the query): default_tmethod string 0:512000 body TEST_WHOLE_MSG /\btest\b/ tmethod TEST_WHOLE_MSG string 0:0 body TEST_FROM_END /\bsincerely,/i tmethod TEST_FROM_END string -100:100 With uri, body, rawbody, etc. limited to just the first 500kB (the default max_size for spamc), we should theoretically then be able to increase the default max_size considerably without much overhead. Any rule (such as the two defined above) that wants to re-scopes itself can do so explicitly. This should significantly reduce scan time, especially if a script regularly drills into the timing logs and reduces the scope of the more resource-intensive rules (incrementally until it affects the hit stats). (In reply to comment #32) > > This should significantly reduce scan time, especially if a script regularly > drills into the timing logs and reduces the scope of the more > resource-intensive rules (incrementally until it affects the hit stats). A 512kB limit will not significantly reduce anything. After all, how many mails are bigger than that? Have you seen my tests? Even a 30kB limit was only 7% faster given enough corpus to test. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582 |