SA Bugzilla – Bug 6735
sa-learn and message max size setting
Last modified: 2015-08-02 21:03:52 UTC
I run a mail scanning infrastructure based on MailScanner ( http://mailscanner.info/ ) this bug is however not MailScanner specific. But MailScanner uses spamassassin directly via its perl module/plugin. So spamc/spamd is not involved, and here comes the problem. When you run sa-learn in such an enviroment there is a hardcoded limit to how big spam/ham messages you can learn tokens from. I'm not sure what the limit is precisely (somebody on the mailing list said 256kbyte). If i try to train my bayes with sa-learn on 1MB or 2MB spam mails it doesnt work, no tokens are learned. On the mailinglist i was told that i could change the following lines in sa-learn: my $iter = new Mail::SpamAssassin::ArchiveIterator( { 'opt_all' => 0, # skip messages over 250k 'opt_want_date' => 0, } ); If you change opt_all to 1 instaid of the default 0, it works. And even big spam mails can be learned. So by changing that i got a temp fix and i can train my bayes properly even for the abnormal big spam mails i do get from time to time. But as was mentioned on the mailinglist, that fix is very unclear and more or less impossible to find for a normal user. So I sugest a parameter or configfile option is added so its configurable in a simple and straight forward way. As far as I can tell the case is the same for all 3.X versions, so im marking version number as unspecified. I hope the developers agree this should be changed. Best regards Jonas Akrouh Larsen
I would propose multiple choices, rather than the current 250KB/unlimited choice. Perhaps 250KB/1MB/4MB/unlimited?
(In reply to comment #1) > I would propose multiple choices, rather than the current 250KB/unlimited > choice. Perhaps 250KB/1MB/4MB/unlimited? My thoughts is that it will simply be an configurable option in the API defaulting to 256 as it does now that can be disabled or set to anything. Then MailScanner can implement whatever limit they want.
amavisd-new will scan the first (x). rather then totally ignoring a large email, it will at least look at the top (x) bytes. that would be an option also.
Why not implement the same command-line parameter that spamc uses? -s max_size, --max-size=max_size
(In reply to comment #4) > Why not implement the same command-line parameter that spamc uses? > > -s max_size, --max-size=max_size That I'd +1 default to what it is, switch for anything else
spamc is not sa-learn ?
(In reply to comment #6) > spamc is not sa-learn ? think consistency
(In reply to comment #7) > (In reply to comment #6) > > spamc is not sa-learn ? > think consistency it just that it should be api level, not in cmd line imho, i was not totaly awake before :-)
(In reply to comment #3) > amavisd-new will scan the first (x). rather then totally ignoring a large > email, it will at least look at the top (x) bytes. > > that would be an option also. Hi Michael You are missunderstanding the question; It has nothing to do with scanning emails, Mailscanner has settings for that and it works fine with spamassassin. This is if you want to use sa-learn, which is more or less seperate for Mailscanner. you simply run the command sa-learn various_parameters spam_mail and it learns tokens fromt hat mail, except if its a bigger email then there is no way to tell it to learn from it, besides the way i posted about originally. Hope that clears up the problem.
(In reply to comment #4) > Why not implement the same command-line parameter that spamc uses? > > -s max_size, --max-size=max_size That would work perfectly fine for me, and solve the problem entirely. At least as far as I can tell as a non-dev user.
(In reply to comment #10) > (In reply to comment #4) > > Why not implement the same command-line parameter that spamc uses? > > > > -s max_size, --max-size=max_size > > That would work perfectly fine for me, and solve the problem entirely. At least > as far as I can tell as a non-dev user. Is sa-learn called from a system/fork with MailScanner? Or is it using the API?
(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #4) > > > Why not implement the same command-line parameter that spamc uses? > > > > > > -s max_size, --max-size=max_size > > > > That would work perfectly fine for me, and solve the problem entirely. At least > > as far as I can tell as a non-dev user. > > Is sa-learn called from a system/fork with MailScanner? Or is it using the > API? I do not believe MailScanner itself calls sa-learn in anyway. But MailWatch (a web based frontend for mailscanner does) calls the sa-learn script from its dir (/usr/local/bin/ on my debian boxes) and thats the only interaction as far as I know
This has not been updated in several years, however I came across the report when trying to deal with exactly the same issue. My version is: sa-learn -V SpamAssassin version 3.4.1 A review of the script shows that is already has a --max-size parameter. Therefore, I'd recommend that this bug be closed.
> This has not been updated in several years, however I came across the report > when trying to deal with exactly the same issue. My version is: > SpamAssassin version 3.4.1 > A review of the script shows that is already has a --max-size parameter. > > Therefore, I'd recommend that this bug be closed. Indeed the sa-learn option was implemented in version 3.4: --max-size <b> Skip messages larger than b bytes; defaults to 256 KB, 0 implies no limit This is a duplicate of Bug 6811, closing. *** This bug has been marked as a duplicate of bug 6811 ***