SA Bugzilla – Bug 6633
new functionality: limiting bayes engine to mail not larger than...
Last modified: 2011-10-04 17:26:11 UTC
I'm getting a lot of spam... Part of them is sended via botnet and they are rather small. They aren't problematic. I've got bigger problem with UCE mails sended via real SMTP engine. They can be scored using standard rules and additional (my onw) rules. Problem appers when such email has a lot of words, bayes engine spends more too much time on this. In result of this all scan is marked as timeouted and mail is delivered to user without any filtering. Maybe it wouldn't be hard to implement such options: - bayes_max_email_size - only emails smaller than ... would be scored by bayes or - bayes_max_words - only emails with count of words lesser than ... would be scanned Second option is better for me, but first option is also sufficient (and probably easier to implement). Thanks.
While that's not implemented, would you get better results if you completely disabled bayes? "use_bayes 0" Ooh, you could set a size threshold for spamassassin to skip all emails as big as are causing you timeouts, and then, perhaps via procmail, run emails over that size through spamassassin again with a flag to disable bayes (--cf='use bayes 0'). Procmail rule to only run on stuff with a minimum size would look something like: :0fw: spamassassin.lock * > 65536 | /usr/bin/spamassassin
I find it really difficult to believe that Bayes would cause scanning timeouts absent other factors. Can you provide debugging traces that show the timeout and include whatever indicators lead you to believe that the number of words is the causative factor? A sample of a spam that SA timed out on would be helpful. Typically when things start timing out we first recommend looking at memory and number of child processes, as the most common cause for timeouts is overloading available memory and hitting swap. What are the resources available on the computer that's hosting SA? There's also the possibility of local rules that have runaway backtracking caused by certain spam content. I'd strongly suggest taking this to the users list and hashing out whether or not it really is the number of words in the messages that's causing the problem.
I've made some observation and i've got sample of problematic mail (this mail is ham). When i invoke spam test using spamassassin<email it took about 2 minutes to scan, it's acceptable by me. Problem appears when i run test using spamc (this is also the way exim does communicate with spamd), spamc quits after 10 minutes without report. Spamd does cpu load for 2 minutes next idles. Strace spamc hangs in this moment: shutdown(3, 1 /* send */) = 0 rt_sigaction(SIGALRM, {0x19375d90, [], 0}, {SIG_DFL, [], 0}, 8) = 0 alarm(600) = 0 recv(3, Darxus, i prefer not to use additional, external tool to scan emails. Imho it should be done at mta level. John, it looks the problem isn't in bayes but i'm not sure what is the root of timeout, this is reason i don't change summary in bug. I'll attach sample email, i have to obfuscate some information inside headers and body, it shuldn't do diference for spamassassin.
Created attachment 4962 [details] sample email Please try to scan this mail using spamc client.
(In reply to comment #4) > Created attachment 4962 [details] > sample email > > Please try to scan this mail using spamc client. Either I've missed something in this bug or: You mean scan a 1.5 MB message? And you're suprised it takes ages? http://spamassassin.apache.org/full/3.3.x/doc/spamc.txt -s *max_size*, --max-size=*max_size* Set the maximum message size which will be sent to spamd -- any bigger than this threshold and the message will be returned unprocessed (default: 500 KB). If spamc gets handed a message bigger than this, it won't be passed to spamd. The maximum message size is 256 MB. What Bayes backend are you using?
Yes, i know this messages is huge. I've to scan such big emails (second option is to use diffrent spamd for big emails but it's not always possible). I'm using postgresql-9.0 as backend for bayes. Please notice: spamd scans this email through ~2 minutes then idles but spamc doesn't return results of scan. Spamc timeouts after 10 minutes. This is main problem for me now because exim connects to spamd and waits for report, doesn't receive results then timeouts.
(In reply to comment #6) > Yes, i know this messages is huge. I've to scan such big emails (second option > is to use diffrent spamd for big emails but it's not always possible). I'm > using postgresql-9.0 as backend for bayes. ok.. *possible* bottleneck #1 : SQL if this not a central Bayes DB for many machines, pls try local SDBM as bayes backend for example: bayes_path /var/bayes/bayes # EXAMPLE ONLY!!!!# bayes_file_mode 0777 ## bayes_store_module Mail::SpamAssassin::BayesStore::SDBM make sure you don't auto expire bayes - this can also help the bottleneck. > Please notice: spamd scans this email through ~2 minutes then idles but spamc > doesn't return results of scan. Spamc timeouts after 10 minutes. This is main > problem for me now because exim connects to spamd and waits for report, doesn't > receive results then timeouts. spamc cannot timeout after ten minutes when the default is 5 http://spamassassin.apache.org/full/3.3.x/doc/spamc.txt -t *timeout*, --timeout=*timeout* Set the timeout for spamc-to-spamd communications (default: 600, 0 disables). If spamd takes longer than this many seconds to reply to a message, spamc will abort the connection and treat this as a failure to connect; in other words the message will be returned unprocessed.
> spamc cannot timeout after ten minutes when the default is 5 > -t *timeout*, --timeout=*timeout* > Set the timeout for spamc-to-spamd communications (default: 600, 0 Are you sure that default is 5 minutes?:) What about spamd idles after 2 minutes? Spamd and postgresql (both are on the same host) does nothing, no CPU usage, no I/O. There is no open transaction or lock in database in this time.
(In reply to comment #8) > > spamc cannot timeout after ten minutes when the default is 5 > > -t *timeout*, --timeout=*timeout* > > Set the timeout for spamc-to-spamd communications (default: 600, 0 > > Are you sure that default is 5 minutes?:) See SA docs. > What about spamd idles after 2 minutes? Spamd and postgresql (both are on the > same host) does nothing, no CPU usage, no I/O. There is no open transaction or > lock in database in this time. - what happens when you pipe the msgs directly to spamc, not via Exim? - pls show us the Exim ACL you're using for this
(In reply to comment #9) > (In reply to comment #8) > > > spamc cannot timeout after ten minutes when the default is 5 > > > -t *timeout*, --timeout=*timeout* > > > Set the timeout for spamc-to-spamd communications (default: 600, 0 > > > > Are you sure that default is 5 minutes?:) > > See SA docs. doh > > > What about spamd idles after 2 minutes? Spamd and postgresql (both are on the > > same host) does nothing, no CPU usage, no I/O. There is no open transaction or > > lock in database in this time. > > - what happens when you pipe the msgs directly to spamc, not via Exim? > > - pls show us the Exim ACL you're using for this
We still do not have a confirmation that the culprit is Bayes. Marcin, please re-read what John Hardin has asked: > I find it really difficult to believe that Bayes would cause scanning timeouts > absent other factors. Can you provide debugging traces that show the timeout > and include whatever indicators lead you to believe that the number of words > is the causative factor? We haven't seen the debug trace yet. The most important part of it (assuming you are using SA 3.3.2, you haven't indicated a version in the bug report ticket) is a timing breakdown report. You can restrict debugging to timing only for starters, if you like: spamassassin -t -D timing <test.msg (the '-D timing' is applicable to spamd too, but let's start with the basics, as you are saying that spamassassin takes about two minutes as well). On our host the regular regexp rules tests on this sample message take 20 seconds and bayes takes 1 second.
(In reply to comment #6) > Please notice: spamd scans this email through ~2 minutes then idles but spamc > doesn't return results of scan. Spamc timeouts after 10 minutes. This is main > problem for me now because exim connects to spamd and waits for report, doesn't > receive results then timeouts. (1) Can you scan this message using spamc against the same SA configuration except with bayes _disabled_ and report the timing? (2) Do you have autolearn turned on? If you do, does turning it off affect the behaviour noticeably?
AXB: I think exim acl doesn't bring any new idea because i can reproduce this problem using spamc. Imho problem is: why spamd idles and doesn't return any response to client? Mark: Now i can see problem isn't with bayes (btw, here is timing with bayes: timing: total 112630 ms - init: 12826 (11.4%), parse: 586 (0.5%), extract_message_metadata: 23647 (21.0%), get_uri_detail_list: 2087 (1.9%), tests_pri_-1000: 66 (0.1%), compile_gen: 1178 (1.0%), compile_eval: 112 (0.1%), tests_pri_-950: 49 (0.0%), tests_pri_-900: 54 (0.0%), tests_pri_-400: 15186 (1 3.5%), check_bayes: 15111 (13.4%), tests_pri_0: 41549 (36.9%), tests_pri_500: 515 (0.5%) and without bayes: timing: total 101894 ms - init: 10177 (10.0%), parse: 478 (0.5%), extract_message_metadata: 24270 (23.8%), get_uri_detail_list: 2280 (2.2%), tests_pri_-1000: 66 (0.1%), compile_gen: 1288 (1.3%), compile_eval: 94 (0.1%), tests_pri_-950: 51 (0.0%), tests_pri_-900: 52 (0.1%), tests_pri_-400: 47 (0.0%), tests_pri_0: 48233 (47.3%), tests_pri_500: 445 (0.4%) both was invoked with such command: spamassassin -t -D timing -x -L <test.email ). It looks bayes isn't bootleneck. Next chapter, i've run strace, tcpdump. It's going to be very interesting. Spamd sends: SPAMD/1.1 0 EX_OK Spam: False ; 1.9 / 5.5 but client doesn't receive response (client is on the same host and i'm stracing spamc). This is strange for me. It's happened only for specific emails (probably for those one scanned very long). My summary: bayes performance isn't problem in my case. The root problem is "dying" tcp connection beetwen spamc and spamd. I'll try to investigate this. (I have no such problem when spamc connects to spamd using socket). I think bug should be closed with something like "wontfix".
> Mark: Now i can see problem isn't with bayes (btw, here is timing with > bayes: timing: > > total 112630 ms - init: 12826 (11.4%), parse: 586 (0.5%), > extract_message_metadata: 23647 (21.0%), get_uri_detail_list: 2087 > (1.9%), tests_pri_-1000: 66 (0.1%), compile_gen: 1178 (1.0%), compile_eval: 112 > (0.1%), tests_pri_-950: 49 (0.0%), tests_pri_-900: 54 (0.0%), tests_pri_-400: > 15186 (1 3.5%), check_bayes: 15111 (13.4%), tests_pri_0: 41549 (36.9%), > tests_pri_500: 515 (0.5%) Thanks. This looks about right for a slow or heavily loaded machine. My numbers for the same message are smaller by about a factor of about 4 .. 7 each, but the ratio is similar. > It looks bayes isn't bootleneck. My bayes takes about 1.1 s. Giving an average factor of 5, yours is still comparatively about 3 times slower, but it is still in a ballpark. Agreed, your bayes isn't bootleneck. > The root problem is "dying" tcp connection beetwen spamc and spamd. > I'll try to investigate this. Start by enabling -D on spamd and capturing the log of one incident on a file (directly or through syslog), before diving into strace and tcpdump.
Closing as WONTFIX - the Bayes is not a bottleneck, no need for additional Bayes configurability based on a message size.