Bug 6735 - sa-learn and message max size setting
Summary: sa-learn and message max size setting
Status: RESOLVED DUPLICATE of bug 6811
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-21 22:22 UTC by jonas@vrt.dk
Modified: 2015-08-02 21:03 UTC (History)
7 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description jonas@vrt.dk 2011-12-21 22:22:18 UTC
I run  a mail scanning infrastructure based on MailScanner ( http://mailscanner.info/ ) this bug is however not MailScanner specific.

But MailScanner uses spamassassin directly via its perl module/plugin. So spamc/spamd is not involved, and here comes the problem.

When you run sa-learn in such an enviroment there is a hardcoded limit to how big spam/ham messages you can learn tokens from. I'm not sure what the limit is precisely (somebody on the mailing list said 256kbyte).

If i try to train my bayes with sa-learn on 1MB or 2MB spam mails it doesnt work, no tokens are learned.

On the mailinglist i was told that i could change the following lines in sa-learn:

my $iter = new Mail::SpamAssassin::ArchiveIterator(
    {
      'opt_all' => 0,       # skip messages over 250k
      'opt_want_date' => 0,
    }
  );

If you change opt_all to 1 instaid of the default 0, it works. And even big spam mails can be learned. So by changing that i got a temp fix and i can train my bayes properly even for the abnormal big spam mails i do get from time to time.

But as was mentioned on the mailinglist, that fix is very unclear and more or less impossible to find for a normal user.

So I sugest a parameter or configfile option is added so its configurable in a simple and straight forward way.

As far as I can tell the case is the same for all 3.X versions, so im marking version number as unspecified.

I hope the developers agree this should be changed.

Best regards

Jonas Akrouh Larsen
Comment 1 Dave Pooser 2011-12-21 22:37:23 UTC
I would propose multiple choices, rather than the current 250KB/unlimited choice. Perhaps 250KB/1MB/4MB/unlimited?
Comment 2 Kevin A. McGrail 2011-12-21 22:40:10 UTC
(In reply to comment #1)
> I would propose multiple choices, rather than the current 250KB/unlimited
> choice. Perhaps 250KB/1MB/4MB/unlimited?

My thoughts is that it will simply be an configurable option in the API defaulting to 256 as it does now that can be disabled or set to anything.

Then MailScanner can implement whatever limit they want.
Comment 3 Michael Scheidell 2011-12-21 22:59:08 UTC
amavisd-new will scan the first (x).  rather then totally ignoring a large email, it will at least look at the top (x) bytes.

that would be an option also.
Comment 4 John Hardin 2011-12-22 01:14:55 UTC
Why not implement the same command-line parameter that spamc uses?

   -s max_size, --max-size=max_size
Comment 5 AXB 2011-12-22 07:35:55 UTC
(In reply to comment #4)
> Why not implement the same command-line parameter that spamc uses?
> 
>    -s max_size, --max-size=max_size

That I'd +1
default to what it is, switch for anything else
Comment 6 Benny Pedersen 2011-12-22 11:41:43 UTC
spamc is not sa-learn ?
Comment 7 AXB 2011-12-22 11:44:50 UTC
(In reply to comment #6)
> spamc is not sa-learn ?

think consistency
Comment 8 Benny Pedersen 2011-12-22 11:50:30 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > spamc is not sa-learn ?
> think consistency

it just that it should be api level, not in cmd line imho, i was not totaly awake before :-)
Comment 9 jonas@vrt.dk 2011-12-22 13:16:19 UTC
(In reply to comment #3)
> amavisd-new will scan the first (x).  rather then totally ignoring a large
> email, it will at least look at the top (x) bytes.
> 
> that would be an option also.

Hi Michael

You are missunderstanding the question;

It has nothing to do with scanning emails, Mailscanner has settings for that and it works fine with spamassassin.

This is if you want to use sa-learn, which is more or less seperate for Mailscanner. you simply run the command sa-learn various_parameters spam_mail and it learns tokens fromt hat mail, except if its a bigger email then there is no way to tell it to learn from it, besides the way i posted about originally.

Hope that clears up the problem.
Comment 10 jonas@vrt.dk 2011-12-22 13:17:09 UTC
(In reply to comment #4)
> Why not implement the same command-line parameter that spamc uses?
> 
>    -s max_size, --max-size=max_size

That would work perfectly fine for me, and solve the problem entirely. At least as far as I can tell as a non-dev user.
Comment 11 Kevin A. McGrail 2011-12-22 13:56:57 UTC
(In reply to comment #10)
> (In reply to comment #4)
> > Why not implement the same command-line parameter that spamc uses?
> > 
> >    -s max_size, --max-size=max_size
> 
> That would work perfectly fine for me, and solve the problem entirely. At least
> as far as I can tell as a non-dev user.

Is sa-learn called from a system/fork with MailScanner?  Or is it using the API?
Comment 12 jonas@vrt.dk 2011-12-23 13:29:10 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #4)
> > > Why not implement the same command-line parameter that spamc uses?
> > > 
> > >    -s max_size, --max-size=max_size
> > 
> > That would work perfectly fine for me, and solve the problem entirely. At least
> > as far as I can tell as a non-dev user.
> 
> Is sa-learn called from a system/fork with MailScanner?  Or is it using the
> API?

I do not believe MailScanner itself calls sa-learn in anyway. But MailWatch (a web based frontend for mailscanner does) calls the sa-learn script from its dir (/usr/local/bin/ on my debian boxes) and thats the only interaction as far as I know
Comment 13 Dan 2015-08-02 15:22:58 UTC
This has not been updated in several years, however I came across the report when trying to deal with exactly the same issue.  My version is:

sa-learn -V
SpamAssassin version 3.4.1

A review of the script shows that is already has a --max-size parameter.

Therefore, I'd recommend that this bug be closed.
Comment 14 Mark Martinec 2015-08-02 21:03:52 UTC
> This has not been updated in several years, however I came across the report
> when trying to deal with exactly the same issue.  My version is:
> SpamAssassin version 3.4.1
> A review of the script shows that is already has a --max-size parameter.
> 
> Therefore, I'd recommend that this bug be closed.

Indeed the sa-learn option was implemented in version 3.4:

  --max-size <b>  Skip messages larger than b bytes;
                  defaults to 256 KB, 0 implies no limit

This is a duplicate of Bug 6811, closing.

*** This bug has been marked as a duplicate of bug 6811 ***