Bug 6633 - new functionality: limiting bayes engine to mail not larger than...
Summary: new functionality: limiting bayes engine to mail not larger than...
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-12 11:42 UTC by Marcin
Modified: 2011-10-04 17:26 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
sample email application/x-bzip None Marcin [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Marcin 2011-07-12 11:42:21 UTC
I'm getting a lot of spam... Part of them is sended via botnet and they are rather small. They aren't problematic. I've got bigger problem with UCE mails sended via real SMTP engine. They can be scored using standard rules and additional (my onw) rules. Problem appers when such email has a lot of words, bayes engine spends more too much time on this. In result of this all scan is marked as timeouted and mail is delivered to user without any filtering.
Maybe it wouldn't be hard to implement such options:
- bayes_max_email_size - only emails smaller than ... would be scored by bayes
or
- bayes_max_words - only emails with count of words lesser than ... would be scanned

Second option is better for me, but first option is also sufficient (and probably easier to implement).
Thanks.
Comment 1 Darxus 2011-07-12 15:05:46 UTC
While that's not implemented, would you get better results if you completely disabled bayes?  "use_bayes 0"

Ooh, you could set a size threshold for spamassassin to skip all emails as big as are causing you timeouts, and then, perhaps via procmail, run emails over that size through spamassassin again with a flag to disable bayes (--cf='use bayes 0').

Procmail rule to only run on stuff with a minimum size would look something like:

:0fw: spamassassin.lock
* > 65536
| /usr/bin/spamassassin
Comment 2 John Hardin 2011-07-12 15:25:57 UTC
I find it really difficult to believe that Bayes would cause scanning timeouts absent other factors. Can you provide debugging traces that show the timeout and include whatever indicators lead you to believe that the number of words is the causative factor? A sample of a spam that SA timed out on would be helpful.

Typically when things start timing out we first recommend looking at memory and number of child processes, as the most common cause for timeouts is overloading available memory and hitting swap. What are the resources available on the computer that's hosting SA?

There's also the possibility of local rules that have runaway backtracking caused by certain spam content.

I'd strongly suggest taking this to the users list and hashing out whether or not it really is the number of words in the messages that's causing the problem.
Comment 3 Marcin 2011-09-20 21:40:15 UTC
I've made some observation and i've got sample of problematic mail (this mail is ham). When i invoke spam test using spamassassin<email it took about 2 minutes to scan, it's acceptable by me. Problem appears when i run test using spamc (this is also the way exim does communicate with spamd), spamc quits after 10 minutes without report. Spamd does cpu load for 2 minutes next idles. Strace spamc hangs in this moment:
shutdown(3, 1 /* send */)               = 0
rt_sigaction(SIGALRM, {0x19375d90, [], 0}, {SIG_DFL, [], 0}, 8) = 0
alarm(600)                              = 0
recv(3,

Darxus, i prefer not to use additional, external tool to scan emails. Imho it should be done at mta level.
John, it looks the problem isn't in bayes but i'm not sure what is the root of timeout, this is reason i don't change summary in bug. I'll attach sample email, i have to obfuscate some information inside headers and body, it shuldn't do diference for spamassassin.
Comment 4 Marcin 2011-09-21 08:03:55 UTC
Created attachment 4962 [details]
sample email

Please try to scan this mail using spamc client.
Comment 5 AXB 2011-09-21 09:03:39 UTC
(In reply to comment #4)
> Created attachment 4962 [details]
> sample email
> 
> Please try to scan this mail using spamc client.

Either I've missed something in this bug or:

You mean scan a 1.5 MB message?  And you're suprised it takes ages?

http://spamassassin.apache.org/full/3.3.x/doc/spamc.txt

-s *max_size*, --max-size=*max_size*
        Set the maximum message size which will be sent to spamd -- any
        bigger than this threshold and the message will be returned
        unprocessed (default: 500 KB). If spamc gets handed a message bigger
        than this, it won't be passed to spamd. The maximum message size is
        256 MB.

What Bayes backend are you using?
Comment 6 Marcin 2011-09-21 09:29:30 UTC
Yes, i know this messages is huge. I've to scan such big emails (second option is to use diffrent spamd for big emails but it's not always possible). I'm using postgresql-9.0 as backend for bayes.
Please notice: spamd scans this email through ~2 minutes then idles but spamc doesn't return results of scan. Spamc timeouts after 10 minutes. This is main problem for me now because exim connects to spamd and waits for report, doesn't receive results then timeouts.
Comment 7 AXB 2011-09-21 09:44:54 UTC
(In reply to comment #6)
> Yes, i know this messages is huge. I've to scan such big emails (second option
> is to use diffrent spamd for big emails but it's not always possible). I'm
> using postgresql-9.0 as backend for bayes.

ok.. *possible* bottleneck #1 : SQL
if this not a central Bayes DB for many machines, pls try local SDBM as bayes backend

for example:

bayes_path /var/bayes/bayes
# EXAMPLE ONLY!!!!#
bayes_file_mode 0777 
##
bayes_store_module           Mail::SpamAssassin::BayesStore::SDBM

make sure you don't auto expire bayes - this can also help the bottleneck.


> Please notice: spamd scans this email through ~2 minutes then idles but spamc
> doesn't return results of scan. Spamc timeouts after 10 minutes. This is main
> problem for me now because exim connects to spamd and waits for report, doesn't
> receive results then timeouts.

spamc cannot timeout after ten minutes when the default is 5

http://spamassassin.apache.org/full/3.3.x/doc/spamc.txt

-t *timeout*, --timeout=*timeout*
        Set the timeout for spamc-to-spamd communications (default: 600, 0
        disables). If spamd takes longer than this many seconds to reply to
        a message, spamc will abort the connection and treat this as a
        failure to connect; in other words the message will be returned
        unprocessed.
Comment 8 Marcin 2011-09-21 10:10:20 UTC
> spamc cannot timeout after ten minutes when the default is 5
> -t *timeout*, --timeout=*timeout*
>         Set the timeout for spamc-to-spamd communications (default: 600, 0

Are you sure that default is 5 minutes?:)
What about spamd idles after 2 minutes? Spamd and postgresql (both are on the same host) does nothing, no CPU usage, no I/O. There is no open transaction or lock in database in this time.
Comment 9 AXB 2011-09-21 10:14:31 UTC
(In reply to comment #8)
> > spamc cannot timeout after ten minutes when the default is 5
> > -t *timeout*, --timeout=*timeout*
> >         Set the timeout for spamc-to-spamd communications (default: 600, 0
> 
> Are you sure that default is 5 minutes?:)

See SA docs.

> What about spamd idles after 2 minutes? Spamd and postgresql (both are on the
> same host) does nothing, no CPU usage, no I/O. There is no open transaction or
> lock in database in this time.

- what happens when you pipe the msgs directly to spamc, not via Exim?

- pls show us the Exim ACL you're using for this
Comment 10 AXB 2011-09-21 10:25:19 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > > spamc cannot timeout after ten minutes when the default is 5
> > > -t *timeout*, --timeout=*timeout*
> > >         Set the timeout for spamc-to-spamd communications (default: 600, 0
> > 
> > Are you sure that default is 5 minutes?:)
> 
> See SA docs.

doh
> 
> > What about spamd idles after 2 minutes? Spamd and postgresql (both are on the
> > same host) does nothing, no CPU usage, no I/O. There is no open transaction or
> > lock in database in this time.
> 
> - what happens when you pipe the msgs directly to spamc, not via Exim?
> 
> - pls show us the Exim ACL you're using for this
Comment 11 Mark Martinec 2011-09-21 10:36:38 UTC
We still do not have a confirmation that the culprit is Bayes.
Marcin, please re-read what John Hardin has asked:

> I find it really difficult to believe that Bayes would cause scanning timeouts
> absent other factors. Can you provide debugging traces that show the timeout
> and include whatever indicators lead you to believe that the number of words
> is the causative factor?

We haven't seen the debug trace yet. The most important part of it
(assuming you are using SA 3.3.2, you haven't indicated a version in
the bug report ticket) is a timing breakdown report. You can restrict
debugging to timing only for starters, if you like:

  spamassassin -t -D timing <test.msg

(the '-D timing' is applicable to spamd too, but let's start with the basics,
as you are saying that spamassassin takes about two minutes as well).

On our host the regular regexp rules tests on this sample message take
20 seconds and bayes takes 1 second.
Comment 12 John Hardin 2011-09-21 13:17:01 UTC
(In reply to comment #6)
> Please notice: spamd scans this email through ~2 minutes then idles but spamc
> doesn't return results of scan. Spamc timeouts after 10 minutes. This is main
> problem for me now because exim connects to spamd and waits for report, doesn't
> receive results then timeouts.

(1) Can you scan this message using spamc against the same SA configuration except with bayes _disabled_ and report the timing?

(2) Do you have autolearn turned on? If you do, does turning it off affect the behaviour noticeably?
Comment 13 Marcin 2011-09-21 15:10:56 UTC
AXB: I think exim acl doesn't bring any new idea because i can reproduce this problem using spamc. Imho problem is: why spamd idles and doesn't return any response to client?
Mark: Now i can see problem isn't with bayes (btw, here is timing with bayes: timing:
total 112630 ms - init: 12826 (11.4%), parse: 586 (0.5%), extract_message_metadata: 23647 (21.0%), get_uri_detail_list: 2087 
(1.9%), tests_pri_-1000: 66 (0.1%), compile_gen: 1178 (1.0%), compile_eval: 112 (0.1%), tests_pri_-950: 49 (0.0%), tests_pri_-900: 54 (0.0%), tests_pri_-400: 15186 (1
3.5%), check_bayes: 15111 (13.4%), tests_pri_0: 41549 (36.9%), tests_pri_500: 515 (0.5%)

and without bayes:
timing: total 101894 ms - init: 10177 (10.0%), parse: 478 (0.5%), extract_message_metadata: 24270 (23.8%), get_uri_detail_list: 2280 (2.2%), tests_pri_-1000: 66 (0.1%), compile_gen: 1288 (1.3%), compile_eval: 94 (0.1%), tests_pri_-950: 51 (0.0%), tests_pri_-900: 52 (0.1%), tests_pri_-400: 47 (0.0%), tests_pri_0: 48233 (47.3%), tests_pri_500: 445 (0.4%)

both was invoked with such command: spamassassin -t -D timing -x -L <test.email
). It looks bayes isn't bootleneck.

Next chapter, i've run strace, tcpdump. It's going to be very interesting. Spamd sends: 
SPAMD/1.1 0 EX_OK
Spam: False ; 1.9 / 5.5
but client doesn't receive response (client is on the same host and i'm stracing spamc). This is strange for me. It's happened only for specific emails (probably for those one scanned very long). 

My summary: bayes performance isn't problem in my case. The root problem is "dying" tcp connection beetwen spamc and spamd. I'll try to investigate this. (I have no such problem when spamc connects to spamd using socket).
I think bug should be closed with something like "wontfix".
Comment 14 Mark Martinec 2011-09-21 16:03:34 UTC
> Mark: Now i can see problem isn't with bayes (btw, here is timing with
> bayes: timing:
>
> total 112630 ms - init: 12826 (11.4%), parse: 586 (0.5%),
> extract_message_metadata: 23647 (21.0%), get_uri_detail_list: 2087 
> (1.9%), tests_pri_-1000: 66 (0.1%), compile_gen: 1178 (1.0%), compile_eval: 112
> (0.1%), tests_pri_-950: 49 (0.0%), tests_pri_-900: 54 (0.0%), tests_pri_-400:
> 15186 (1 3.5%), check_bayes: 15111 (13.4%), tests_pri_0: 41549 (36.9%),
> tests_pri_500: 515 (0.5%)

Thanks. This looks about right for a slow or heavily loaded machine.
My numbers for the same message are smaller by about a factor of about
4 .. 7 each, but the ratio is similar.

> It looks bayes isn't bootleneck.

My bayes takes about 1.1 s.  Giving an average factor of 5, yours is
still comparatively about 3 times slower, but it is still in a ballpark.
Agreed, your bayes isn't bootleneck.

> The root problem is "dying" tcp connection beetwen spamc and spamd.
> I'll try to investigate this.

Start by enabling -D on spamd and capturing the log of one incident
on a file (directly or through syslog), before diving into strace
and tcpdump.
Comment 15 Mark Martinec 2011-10-04 17:26:11 UTC
Closing as WONTFIX - the Bayes is not a bottleneck, no need
for additional Bayes configurability based on a message size.