5497 – Bayes has become unusable

Bug 5497 - Bayes has become unusable

Summary: Bayes has become unusable

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Learner (show other bugs)
Version:	3.2.0
Hardware:	Other other

Importance:	P5 normal
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-06-05 13:44 UTC by Ben Lentz
Modified:	2007-06-15 07:03 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ben Lentz 2007-06-05 13:44:01 UTC

I've have great success with Bayes until very recently, coinciding with the
upgrade to 3.2.0. Our users have begin to get a completely unacceptable False
Positive rate. No other changes have really been made, except for the upgrade to
3.2.0.

I have attempted to truncate my bayes database and re-learn it (it's being
stored in a MySQL database). Over only a weekend's time (and with no end-user
input, only auto-learning input), the false postives were back, nearly always
caused by a large amount of ham being tagged with BAYES_99.

I also dropped the tables and recreated them from the provided .sql script, but
it doesn't appear to have made a difference.

Because of the end-user impact at my site, I've resorted to doing:
score BAYES_50 0
score BAYES_60 0
score BAYES_80 0
score BAYES_95 0
score BAYES_99 0
to correct the false positive rate.

I'm finding that several of the false positives are from Outlook 200x clients
using Microsoft Word as a (crappy) HTML generator... and because both ham and
spam share the same MSWord body content, many autolearned tokens are being input
as spam. I'm not sure if this specific platform has been a substantial cause of
my recent issues or not, just something I thought I'd mentioned.

Has any of the bayes code been modified with 3.2.0? Has the learning grace
period been shortened? Has the autolearn code been modified? Has the Internet
trended in such a way that bayes poisoning is more common than it was a few
weeks ago? If so, would you consider lowering the point values for these rules
since they're growing less effective?

Or am I totally nuts and it's only my site that's having a substantially harder
time with bayes accuracy?

Thanks.

P.S. I did mail the users mailing list before reporting this problem here.

Comment 1 Ben Lentz 2007-06-05 13:51:57 UTC

Also, I have a copy of my "poisoned" database that was generating many FPs after
this past weekend, if it's helpful. It's 2.2MB compressed.

Comment 2 Graham Murray 2007-06-05 14:31:58 UTC

I suspect that at least of the reason is the change in the auto-learn ham
threshold which means that 3.2 auto-learns a lot fewer messages as ham than
previous versions did. Which means that if you are just auto-learning, and not
calling sa-learn to learn ham, a disproportionate number of spam tokens will be
learnt.

Comment 3 Rob Janssen 2007-06-06 14:09:47 UTC

It looks like auto-learn is broken.  Spamassassin 3.2.0 at our site has not
auto-learned any message as ham since May 21st, and continues to auto-learn spam.
Strange thing is that I have installed 3.2.0 on May 18th and it has apparently
continued to work OK for 3 days.

bayes_auto_learn_threshold_nonspam which defaults to 0.1 in the
AutoLearnThreshold plugin is set to -1.0 in 10_default_prefs.cf
I have set it back to 0.1 in our local conf but sofar it does not seem to help.

This is destroying the bayes database!  What has happened?

Comment 4 Ben Lentz 2007-06-06 14:31:05 UTC

> I suspect that at least of the reason is the change in the auto-learn ham
> threshold which means that 3.2 auto-learns a lot fewer messages as ham than
> previous versions did. Which means that if you are just auto-learning, and not
> calling sa-learn to learn ham, a disproportionate number of spam tokens will be
> learnt.

I can definitely say that a vast majority (+95%) of our learning is from the >
auto-learning system. It has been difficult to get our users to feed significant
amounts of both ham and spam to our sa-learn feedback mechanism. Any change in >
the auto-learn thresholds are likely effecting the accuracy of our database.

Is this something simple I can change to revert back to the pre-3.2.0 setting?

What was the previous value for bayes_auto_learn_threshold_nonspam?

> USER OPTIONS
>        The following configuration settings are used to control auto-learn-
>        ing:
> 
>        bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
>            The score threshold below which a mail has to score, to be fed
>            into SpamAssassinâs learning systems automatically as a non-spam
>            message.
> 
>        bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
>            The score threshold above which a mail has to score, to be fed
>            into SpamAssassinâs learning systems automatically as a spam mes-
>            sage.
> 
>            Note: SpamAssassin requires at least 3 points from the header, and
>            3 points from the body to auto-learn as spam.  Therefore, the min-
>            imum working value for this option is 6.

Comment 5 Rob Janssen 2007-06-06 14:43:36 UTC

For info: I have backed out the sa-update by rm -r
/var/lib/spamassassin/3.002000 and restarting spamd, and now it auto-learns again.
So the problem probably is in the updated cf files loaded by sa-update.

Comment 6 Sidney Markowitz 2007-06-06 14:58:53 UTC

I can see the change in the ham autolearn threshold having this effect in
interaction with any rule that hits, for example, MS Word pasted into Outlook
Express mail, such as EXTRA_MPART_TYPE.

We tolerate a rule with as low a score as 1.0 even if it FPs with the
justification that it is not scored high enough to cause ham to FP as spam.

But here is an example of an unintended side effect. If the result is to cause
all of a certain class of mail to have its spam instances learned but none of
its ham instances learned, then to the degree that Bayes finds tokens that
indicate that class, they will be incorrectly learned as spam signs.

I don't have an immediate answer to this, but clearly a result that causes all
MS Word pasted into OE mail to be labeled as spam is incorrect.

Comment 7 Daryl C. W. O'Shea 2007-06-06 15:09:00 UTC

(In reply to comment #5)
> For info: I have backed out the sa-update by rm -r
> /var/lib/spamassassin/3.002000 and restarting spamd, and now it auto-learns again.
> So the problem probably is in the updated cf files loaded by sa-update.

The 3.2.0 release and all 3.2.0 updates have identical bayes_auto_learn* values.
 Can you identify what exactly you are seeing different?  Any useful debug output?

10_default_prefs.cf:bayes_auto_learn_threshold_nonspam  -1.0
10_default_prefs.cf:bayes_auto_learn_threshold_spam             12.0
10_default_prefs.cf:bayes_auto_learn                    1

Comment 8 Rob Janssen 2007-06-06 15:28:18 UTC

We have spamd scanning all incoming messages with a single bayes db.
What I see is not a single message after May 21st caused an autolearn=ham entry
in the mail log.  All have either autolearn=no or autolearn=spam, even when the
score is -2.5 or so.
No wonder that the bayes db only knows about spam and not about ham after some
time, and the bayes score creeps upward.

After I removed the update (as described), the first ham message that came in,
with a score of -2 due to AWL, immediately logged autolearn=ham.

May well be that the thresholds are not changed, but it looks like the special
marking of scores to affect the bayes autolearn has changed in such a way that
no message is below the notspam threshold anymore.

Comment 9 Sidney Markowitz 2007-06-06 15:35:40 UTC

Isn't it the case that AWL is not included in computing the score for the
autolearning threshold?

Can you run a few ham mails through spamassassin -t with the updates included
and see if there is some rule that consistently FPs with enough points to
preclude autolearning as ham?

Actually, I'm confused as to how an autolearn ham threshold of -1 can work at
all.  I would think that we don't have enough negative scoring rules to catch a
significant amount of ham at that threshold no matter what.

Comment 10 Rob Janssen 2007-06-06 15:47:52 UTC

About 5000 messages from very diverse source were scanned during that interval,
and NONE of them marked as autolearn=ham.  The first one after I undo the update
is OK.  I'd say it is not something common among the messages, but something
wrong in the update.

The value of the auto_learn level, that is compared with the thresholds
specified in the config, does not seem to be logged or displayed anywhere so it
is difficult to see what is really happening.
I tried to do a diff between base 3.2.0 and the update.  It looks like a big
change was made to the scoring so it is difficult to pinpoint where the problem is.

Comment 11 Sidney Markowitz 2007-06-06 15:57:37 UTC

> I'd say it is not something common among the messages,
> but something wrong in the update

What I am suggesting is that to find out what about the update is wrong, use
spamassassin with the -t option with the updates included to run a sample of the
ham messages that you would have expected to have been learnt. The -t option
shows you exactly which rules fired and how many points they added to or
subtracted from the score.

If you do that you may be able to see if some rule or rules from the update
consistently fire, preventing autolearning of ham.

Comment 12 Ben Lentz 2007-06-06 18:16:54 UTC

Sorry I missed this in a previous response:

> bayes_auto_learn_threshold_nonspam which defaults to 0.1 in the
> AutoLearnThreshold plugin is set to -1.0 in 10_default_prefs.cf
> I have set it back to 0.1 in our local conf but sofar it does not seem to help.

And I agree... when would a ham email ever score a -1.0 under normal
conditions.? Holy smokes! Even a SPF/DK/DKIM email is -0.0 for each test.
Without a BAYES_00, or HABEAS/HASHCASH rule, things are rarely below 0.0.

I can also confirm that I've been updating my 3.2.0 system regularly with
sa-update. So if this is indeed an issue introduced in a sa-update of re-scoring
rules (it seems it's been confirmed that the bayes stuff itself, at least on the
surface, has remained unchanged), then it's possible that I am exhibiting it.

Should I set bayes_auto_learn_threshold_nonspam "back" to 0.1, remove the
sa-update config files, disable my regular sa-updating, destroy my current bayes
database and start over?

Comment 13 Sidney Markowitz 2007-06-06 19:11:22 UTC

> Should I set bayes_auto_learn_threshold_nonspam "back" to 0.1,
> remove the sa-update config files, disable my regular sa-updating,
> destroy my current bayes database and start over?

I don't think that the auto_learn_threshold should ever be blindly set to an
arbitrary number. I might even argue that we should not try to provide a default
value for the autolearn threshold.

You can look over a large sample of ham and spam to see what makes sense as
thresholds at your site. Without having to carefully verify every mail you
probably can come up with a pretty good number for a score below which you can
be pretty certain that you hardly ever see spam and another number for a score
above which you never see ham. Then use those numbers.

If the bug here is that an auto_learn_ham threshold of -1 makes no sense, then
the fix is to chabge the threshold to something that makes sense for your
installation, not elimnate sa_update.

On the other hand, if this problem appears after an sa-update, I really would
like to see what you can learn about the affect of the sa-updateed rules by
looking at the output of spamassassin -t as I mentioned in comment #11. There
could be something not quite right about some updated rule even if the overall
problem can be characterised as a bad auto_learn_ham threshold.

Comment 14 Daryl C. W. O'Shea 2007-06-06 19:22:36 UTC

The only changes in the current updates are (unless someone has changed a
sandbox rule that ships in the base release):

 - URIBL_BLACK and URIBL_GREY now work again
 - the new sandbox rules added, with scores in 72_scores.cf
 - some fixes to the comment blocks in some files

Comment 15 Ben Lentz 2007-06-06 20:06:11 UTC

I can say, with great certainty, that the default values provided in all
versions prior to 3.2.0 seemed to work great at my site without any tweaking or
heavy analysis.

I don't wish to change anything blindly, instead, I very specifically want the
functionality I had before the upgrade - back.

I am not asserting that sa-update was the cause, I was only offering it as a
relevant piece to my SA configuration.

Comment 16 Sidney Markowitz 2007-06-06 21:20:34 UTC

Ben, please, I understand that you would just like to revert to a time that this
problem did not exist, but it would help in moving forward to understand exactly
how things are going wrong.

One apparent problem is that as of version 3.2 the threshold for ham was set in
10_default_prefs.cf as -1.0, which is hard to imagine would trigger on much if
any spam.

Another apparent problem seems to have something to do with a change that was
installed by sa-update, as we see from comment #5, and that seems to happen even
with the threshold set to 0.1.

Rob Janssen and Ben Lentz, to figure out exactly what is going on I would like
to see the results of running spamassassin -t -D on a message that should be
autolearned as ham but is not, in a system that has been updated by sa-update.
Since it would have to be ham, pick something that does not have any private
information in it to make it as simle as possible to clean it of anything you do
not want published. Anything that is long should be attached to this bug as an
attachment in Bugzilla using the Create a New Attachment link, not pasted into a
comment.

Whether or not it makes sense to increase the ham autolearn threshold, comment
#5 hints at there being something else wrong that we need to fix, and to do that
we need some information.

Comment 17 Sidney Markowitz 2007-06-06 21:32:26 UTC

I want to reference bug 5257 here as related to this so we have it documented.
That's the bug that was the justification for changing the default ham
autolearning threshold to -1.0

Comment 18 Ben Lentz 2007-06-06 22:12:19 UTC

> I would like
> to see the results of running spamassassin -t -D on a message that should be
> autolearned as ham but is not, in a system that has been updated by sa-update.

Hey Sidney,
I completely understand that, too. Can you define for me what you mean by
"should be" autolearned as ham? By _current_ definition of the default SA 3.2
config, I would interpret "should be" as a message that otherwise scores at <=
-1.0 without bayes, which, even Justin admits in 5257, is 1.21% of ham. It might
take me a while to locate such a highly rare message...

Or, do you mean "should be" as in the _old_ definition of a ham? e.g. <= 0.1?
That, I imagine, would be much easier.

Thanks for the help thus far...

Comment 19 Sidney Markowitz 2007-06-06 22:37:57 UTC

> Can you define for me what you mean by
> "should be" autolearned as ham?

I think I'm guilty of conflating your comments with Rob Janssen's comments.

If the only thing that is the matter with your setup is that an autolearn
threshold for ham of -1.0 is too low to ever trigger, then all you need to do is
to set it up to a number that does work for you. Whether you determine that
number from past experience with version 3.1 and make it 0.1, or you actually
look at the range of scores in your ham and pick a number that will capture most
of the true ham with almost no false hits, either way will be better than what
you have now.

But if, like Rob, you find that after running sa-update even a ham autolearn
threshold of 0.1 never learns, then we need to find out why _that_ happens, and
that's what the test of spamassassin -t -D would be for. So for that test I
would want to see the results for something that should be learned as ham with
the higher threshold of 0.1 but is not.

I guess your next steps are simple. Change the threshold in your local.cf to
0.1, clear your Bayes database, and see if you start autolearning any ham. If it
works, you are done with this problem. If, like Rob, you don't start
autolearning ham, then run spamassassin -t D on some ham that you are pretty
sure should have been autolearned and we'll try to figure out why.

Comment 20 Rob Janssen 2007-06-07 00:40:52 UTC

Maybe my conclusion that it is ONLY related to the update was a bit quick, but I
still cannot explain the course of events...:
- before installing 3.2.0 it worked fine.  messages were regularly autolearned
as ham, and most spam was autolearned as such.  of course there were also a lot
of autolearn=no messages.
- after installing 3.2.0 (and without touching the threshold) it looks like it
has worked for 3 days and then no more ham was learned.  it may be that after 3
days the sa-update was installed, but I think I ran that manually immediately
after install.  it could also be that there was a new update around the 21st.
- yesterday I backed out the update and it seemed it worked, but now I see that
it has only autolearned on two out of 100 messages so probably still a problem.

Now I manually set the threshold to 0.1 and will watch it again for a day before
doing the sa-update again and see if that changes anything.

Sure I tried some "spamassasin -t" but there is little that can be learned from
its output, because there is no indication which of the scores are actually
counted for the auto_learn and what the auto_learn value finally is.  There
seems to be a lot of magic in the code that decides whether to auto-learn or
not, and only the final decision is logged.

Comment 21 Sidney Markowitz 2007-06-07 00:53:16 UTC

> I tried some "spamassasin -t" but there is little that
> can be learned from its output

That's why you would also used -D, but that produces so much output that you
would not use that until you have determined that there is a bug to be found. So
first see what happens when you test carefully with different thresholds and
with and without the updates. If you can show that it is the update that makes
the difference even with the higher threshold, then running spamassassin -t -D
can be used to produce heaps of information that we can pore through to figure
out what is going on.

Comment 22 Justin Mason 2007-06-07 01:32:49 UTC

there's an assumption here that the Bayes subsystem cannot deal with training
databases heavily biased in one direction over the other.  However, I'm pretty
sure this is not correct.  The bayes algorithms take this into account, so it
should be compensated for just fine.  

Hence, the small number of ham hit with a -1.0 threshold (about 1.21% of ham
mail according to mass-checks, see bug 5257) should be no problem to it.  if
your site sees a different ham/spam ratio, however, you may want to make a
site-specific customisation to increase your autolearn-ham threshold...

Comment 23 Justin Mason 2007-06-07 01:36:59 UTC

> I don't think that the auto_learn_threshold should ever be blindly set to an
> arbitrary number. I might even argue that we should not try to provide a default
> value for the autolearn threshold.
> 
> You can look over a large sample of ham and spam to see what makes sense as
> thresholds at your site. Without having to carefully verify every mail you
> probably can come up with a pretty good number for a score below which you can
> be pretty certain that you hardly ever see spam and another number for a score
> above which you never see ham. Then use those numbers.

in passing, this was suggested ages ago in bug 1829 as an alternative
autolearning algorithm; unfortunately that bug got derailed by random
discussion, and was never implemented :(  it's a good idea.

Comment 24 Rob Janssen 2007-06-07 01:45:26 UTC

Here is an example of spamassassin -t -D output:

[12770] dbg: plugin:
Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH(0x8ee5b7c) implements
'autolearn_discriminator', priority 0
[12770] dbg: learn: auto-learn: currently using scoreset 3, recomputing score
based on scoreset 1
[12770] dbg: learn: auto-learn: message score: -2.43672972972973, computed score
for autolearn: 0.001
[12770] dbg: learn: auto-learn? ham=0.1, spam=6, body-points=0.001,
head-points=0.001, learned-points=-2.599
[12770] dbg: learn: auto-learn? yes, ham (0.001 < 0.1)
[12770] dbg: learn: initializing learner
[12770] dbg: learn: learning ham

Now it auto-learns, but clearly only because I now have changed the threshold.
The score of this message is: AWL,BAYES_00,HTML_MESSAGE
This is the most common score for our incoming ham-mail.  HTML_MESSAGE scores as
0.001 and the other two are negative scores but apparently aren't counted.

It looks like the main cause of trouble is the negative nonspam threshold.  It
may be that the problem gets worse when I install the update again, because
there were a couple of new scores in there.  But first I want to observe this
setting for a day.

Comment 25 Sidney Markowitz 2007-06-07 01:47:37 UTC

> The bayes algorithms take this into account, so it
> should be compensated for just fine

It should compensate for different absolute numbers of ham vs spam in the
collection, but it can't compensate for a collection process that biases against
some class of ham. For example, consider that all Outlook Express mail that
contains embedded graphics in HTML as cid MIME objects trigger the
EXTRA_MPART_TYPE rule for 1.0 point. No ham that has that will be autolearned
and all high scoring spam that has that will. If there are any tokens that are
characteristic of that kind of mail, the effect will be to amplify the
EXTRA_MPART_TYPE FP from producing just 1 extra point to producing 1 plus a high
score from Bayes.

That's how I interpreted what is going on here. The summary describes that kind
of amplification of FPs on tokens found in MS Word generated HTML.

Comment 26 Rob Janssen 2007-06-07 01:58:10 UTC

(In reply to comment #13)

> You can look over a large sample of ham and spam to see what makes sense as
> thresholds at your site. Without having to carefully verify every mail you
> probably can come up with a pretty good number for a score below which you can
> be pretty certain that you hardly ever see spam and another number for a score
> above which you never see ham. Then use those numbers.

Is there a flag setting that allows you to do this?  I.e. something that I can
invoke with a directory full of ham and spam messages and prints a list of
auto_learn values of those messages?
I know how to list the final spamassassin scores of a set of messages, but that
is of little or no value in this case because the main negative scorers are not
counted for auto_learn.

Comment 27 Sidney Markowitz 2007-06-07 02:24:11 UTC

> that is of little or no value in this case because the main
> negative scorers are not counted for auto_learn

That's a good point. Perhaps we should add another word to the header template
for showing the autolearn score. That would be very easy to do.

Comment 28 Sidney Markowitz 2007-06-07 02:55:35 UTC

I created an RFE, bug 5502, and uploaded a patch there that adds a
_AUTOLEARNSCORE_ tag you can put in a header to see what score autolearn would
see in messages. The patch shouild apply cleanly to either trunk or the 3.2 branch.

Comment 29 Mark Martinec 2007-06-07 05:16:27 UTC

(In reply to comment #22)
> the small number of ham hit with a -1.0 threshold (about 1.21% of ham
> mail according to mass-checks, see bug 5257) should be no problem to it.

If the change of a threshold to -1.0 was intended to reduce the number
of ham message fed into bayes, in my view it is a wrong solution to
the problem. Moving a threshold into gaussian periphery results in a
skewed/distorted view being presented to autolearning.

If the only intention is to reduce the amount of ham fed to autolearner,
some decimation can be used, like comparing a random generator value
to some threshold, say 0.2 would keep every fifth ham on the average
for autolearning and ignore the rest.

Comment 30 Ben Lentz 2007-06-07 05:18:28 UTC

> The summary describes that kind of amplification of FPs on tokens found in MS
Word generated HTML.

This was simply an observation whereas ham messages would contain similar tokens
to spam messages (I verified this by hand), and it should not be assumed that
this is the main cause of the FP amplification.

There were certainly hams scoring BAYES_99 encoded entirely in plain text, too.

Comment 31 Ben Lentz 2007-06-07 05:20:50 UTC

Just a newbie comment, but:

Shouldn't the goal to be to autolearn ~80% of spam and also to autolearn ~80% of
ham, too? Why is it a good thing(tm) to target autolearning ~1% of ham? Just in
general?

Comment 32 Sidney Markowitz 2007-06-07 05:54:21 UTC

My interpretation of the comments in bug 5257 is that people were reporting
problems with a threshold of 0.1 because of too many low scoring spam being
incorrectly learned so the ham threshold was lowered to -1.0.

The comment there about it getting 1.21% of the ham was not to say that such a
small percentage is desirable, it meant that in our mass check corpus a
threshold of -1.0 did find 1.21% and that is enough for Bayes to function.

It sounds like at your site a -1.0 threshold gets much less than 1.21% of the
ham and it isn't enough for bayes to function.

Comment 33 Ben Lentz 2007-06-07 07:12:42 UTC

Don't these autolearn thresholds need to be _proportional_? Using your corpus as
a metric, if you're only autolearning 1.21% of ham, don't you want to adjust the
bayes_auto_learn_threshold_spam value to only autolearn 1.21% of spam, too?

Otherwise, what I'm seeing at my site (regardless of size, I believe), is that
the bayes data is much less accurate and is more easily be influenced by bayes
poisoning (image spam + normal dictionary words, sentences, and paragraphs
hidden behind or below the image)... without a proportional ham differential.

Comment 34 Justin Mason 2007-06-07 07:25:02 UTC

perhaps the best fix, then, is to make autolearning as *spam* only happen if by
doing so, the database counts aren't rendered unbalanced?

e.g. if autolearn would cause one type to include 2x as many mails as the other
type, then skip it.

Comment 35 Ben Lentz 2007-06-07 07:29:46 UTC

> perhaps the best fix, then, is to make autolearning as *spam* only happen if 
by
> doing so, the database counts aren't rendered unbalanced?

So, before doing an autolearn, it would check the query counts in the bayes 
database beforehand to make sure the additional learn wouldn't unbalance the 
database? Sounds like a sound solution, although it sounds like it'd be quite a 
chunk of work...

Currently, my ratio is approximately 16:1 spam learned to ham learned.

Comment 36 Sidney Markowitz 2007-06-07 07:41:08 UTC

No, I think that the principle that the Bayes learner is relatively insensitive
to the absolute numbers of ham and spam is correct. The problem is that it is
sensitive to getting non-representative samples. To the degree that tokens in
the very lowest scorers are not representative of the tokens in all ham, the
learner will not be accurate.

Autolearning is a substitute for a mechanism in which all ham and all spam are
correctly learnt. The more you get away from that ideal by being conservative
with the threshold, the weaker it will make Bayes.

Comment 37 Mark Martinec 2007-06-07 07:46:57 UTC

> perhaps the best fix, then, is to make autolearning as *spam* only happen
> if by doing so, the database counts aren't rendered unbalanced?

I don't think this is a good idea, I hope there is a better solution,
like dropping old spam.

In my experience it is important that new spam samples are incorporated
into a bayes db soon, so that the whole system is able to react to
new spam profiles and techniques. If a slow trickle of ham will be
able to block learning of new spam, it would block such swift response.

Comment 38 Ben Lentz 2007-06-07 08:13:21 UTC

> No, I think that the principle that the Bayes learner is relatively
> insensitive to the absolute numbers of ham and spam is correct. The problem 
> is that it is sensitive to getting non-representative samples.

I think my experience is to the contrary. I believe my bayes database has taken 
a serious accuracy hit as the result of a lower number of autolearned hams from 
the threshold setting change introduced in 3.2.0. I'd say it's very sensitive, 
given that the characteristics of what I'm learning and what I'm not have 
otherwise remain unchanged.

Comment 39 Rob Janssen 2007-06-07 10:03:50 UTC

(In reply to comment #32)
> My interpretation of the comments in bug 5257 is that people were reporting
> problems with a threshold of 0.1 because of too many low scoring spam being
> incorrectly learned so the ham threshold was lowered to -1.0.

It is not clear if the problem is that spam is incorrectly learned as ham, or
that there are just too many learning operations going on on a heavily loaded
system and it would be desirable to cut it down a bit.  When the latter is the
actual problem, I suggest the method from comment #29 to be used, not a change
of the threshold.

Problem is that in our case we apparently get no messages below auto_learn -1.0
at all.  Even at 0.1 there are many ham messages not handed to auto_learn
because it is so easy to get above 0.1 when AWL and BAYES_00 are not counted.
Rules like RDNS_NONE are firing half of the time here, HTML_MESSAGE almost
always.  This means that even a slight change in the scoring will easily disable
the autolearn=ham.  This probably explains the change in behavior when
installing the update to 3.2.0
I will try the mentioned patch and see what a more reasonable value for the
threshold is in our case.

I can understand that you want to avoid feed-forward lockups by excluding the
score of BAYES_xx in the calculation, and to a lesser extent I can understand
the exclusion of AWL, but all together it makes the auto_learn quite fragile.

Something that also affects our Bayes DB is that we are a locally operating
company where 99+ % of all mail is in Dutch.  So the Bayes engine has learned
over time that Dutch=HAM and English=SPAM.  This normally works well, but when
someone sends a message from freemail providers that tag an English commercial
under each mail, and they send only an attachment with little body text, it is
scored at Bayes_80 or more, and lifted over our spam threshold by simple things
like omitting the subject.
And those messages are never learned as ham because those freemail providers
invariably score points in the "ignorance" and "HTML" categories.  So our Bayes
DB never gets learned that "Choose the right car based on your needs.  Check out
Yahoo! Autos new Car Finder tool." does not really mean the message is SPAM.

Comment 40 Mark Martinec 2007-06-07 10:56:17 UTC

> Problem is that in our case we apparently get no messages below
> auto_learn -1.0 at all.  Even at 0.1 there are many ham messages
> not handed to auto_learn because it is so easy to get above 0.1...

You are not alone. See recent topics on the mailing list:
- bayes autolearn - nonspam threshold
- Bayes problem: very large spam/ham ratio

From Fletcher Mattox:
  After years of stability, my bayes db is doing poorly.  When I first
noticed it, it was classifying lots of ham BAYES_99, I cleared the db
and started over.  Now it finds *very* few ham. [...]
  The first three lines are the only autolearned ham. That's it. All day.
There were thousands of hams which were not learned. Notice the quantum
leap between -4.299 and 12 in the score used by auto-learn  [...]
  I have restored the original 0.1 and bayes has started working again
for me. I am now auto-learning much more ham than I was yesterday. [...]

From Duane Hill:
  Therefore, anyone using SA 3.2 who was a prior version without
the  bayes_auto_learn_threshold_nonspam setting will ultimately
have to set the value now to 0.1.

I agree the -1 is unrealistically low. I had to force it back to 0.1 too.

Comment 41 Rob Janssen 2007-06-07 13:07:58 UTC

(In reply to comment #28)
> I created an RFE, bug 5502, and uploaded a patch there that adds a
> _AUTOLEARNSCORE_ tag you can put in a header to see what score autolearn would
> see in messages.

I applied this patch, then tried to lookup how the log line in spamd can be
redefined.  Looking in the code I found that it is hardwired (line 1623 in spamd).
Is there any reason why the log line is not configured using those tags?
For now I patched spamd to add the autolearnscore in parentheses after the action.
push(@extra,
"autolearn=".$status->get_autolearn_status().'('.$status->get_autolearn_points().')');

Comment 42 Sidney Markowitz 2007-06-07 13:21:24 UTC

Rob, adding template tags to the spamd log would be a good enhancement. I'll
open a RFE to that effect but I don't think it will make it into 3.2.1. Your
patch certainly makes it more convenient to track the autolearn scores than what
I was thinking of -- adding it to an X-header and then having to collect and
scan the headers of the actual emails that have been processed.

Comment 43 Loren Wilton 2007-06-07 17:56:44 UTC

Re comment #40:
> I agree the -1 is unrealistically low. I had to force it back to 0.1 too.

The subject of people having problems with Bayes because it learns too much 
spam as ham has been around for years, and the change of the threshold was an 
attempt to fix this.

However, it is my (perhaps faulty) recollection that most people that had to 
change the threshold to avoid learning spam as ham changed the threshold to -
0.1 rather than -1.

It might be interesting to see if threshold values of 0 or -0.1 would still 
work for some of the people that are having problems with the -1 value.

(Alternately stated: I think the concept of lowering the threashold was 
probably reasonable.  It is just the value picked for the lowered threshold 
that seems probably unreasonable.  A value closer to 0 or slightly negative 
might solve the problems for the people with spam-as-ham problems while not 
hurting people that were fine with the old threshold.)

Comment 44 Matt Kettler 2007-06-07 21:46:20 UTC

Changing from -1.0 to -0.1 might be useful, however the stock spamassassin
ruleset contains very few rules that are negative-scoring to begin with, and
even fewer that are between these two levels.

Ignoring bayes (as it's ignored) the only extra rules you're adding that can
cause autolearning are:

score RCVD_IN_IADB_OPTIN_GT50 0 -0.499 0 -0.245
score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1
score HABEAS_CHECKED 0 -0.2 0 -0.2

And if you've got hashcash on:
score HASHCASH_20 -0.500
score HASHCASH_21 -0.700


It's a start, but I think a better long-term solution would be to do what I've
been doing on my server for quite a while. Use a very small negative score as a
threshold (-0.001) and introduce several "nice" rules with these small negative
scores.

Since the rule scores are small, you can't get the historical problem that
caused most of the negative-scoring comp rules to be wiped out. In that,
spammers crafted their emails to rack up large numbers of these rules, and
effectively whitelist the message. Here the scores are too small, you could rack
up 20 of em and only get -0.02 for your efforts.

Spammers could still do the same thing to make their messages "qualify" for
autolearning, but they'd also have to avoid all the spam rules.

The old system of having a small positive score had the problem that nonspam
autolearning was more or less "by default", as long as you didn't hit any spam
rules. This meant that some new variant spams wound up being autolearned as nonspam.

My suggestion here still has the same basic problem, but it at least adds some
hoops to jump through in order to qualify.

 The biggest problem here would be crafting rules that would at least be
somewhat difficult for spammers to arbitrarily add to their messages. My rules
don't meet this, as they rely on being "secret" to avoid detection, and are
largely based on "industry keywords" for my company.

Actually, That inspired me, why not make it a user-configured "goodwords" file?
Then they could add words related to *their* company and/or personal interests.
A plugin could scan for any of these words and trigger a single -0.001 scoring
rule. We'd just have to pick a good name to avoid people
thinking it was a whitelist system :)

Comment 45 Loren Wilton 2007-06-07 23:30:19 UTC

> and trigger a single -0.001 scoring rule.

To extend this slightly, along the lines of your "several small-scoring rules" 
this might be something that allowed multiple words or phrases and scored each 
one at -.001, with a cumulative limit on the score of -.05 or some such.  This 
would allow users to specify multiple things that seemed unique to them.

Of course this could be done today with simple body rules for the most part.

Comment 46 Rob Janssen 2007-06-08 00:38:42 UTC

(In reply to comment #44)

I think it is questionable practice to put a default setting in the distribution
and then expect all the users to add custom rules to keep it from failing.
When users want to do what you want, they can easily put their custom
bayes_auto_learn_threshold_nonspam value into local.cf and add their custom rules.
The default setting for bayes_auto_learn_threshold_nonspam should, IMHO, be such
that the system at least does some ham auto_learning.
My suggestion is to set it back to the old default 0.1 and leave the tinkering
with negative values to those that have an actual problem and want to spend
effort on it.
I don't see any problem at level 0.1 and now have even increased it to 0.2
because we have so many mails from hotmail that all score
AWL,BAYES_00,HTML_MESSAGE,RDNS_NONE,SPF_PASS for an autolearn score of 0.101
(and a final score of -3 or so).  I want their taglines learned as ham, not
spam, because they else FP the bayes check when a file is sent with little or no
message body.

Comment 47 Sidney Markowitz 2007-06-08 00:58:19 UTC

> My suggestion is to set it back to the old default 0.1

We (the developers) agree. We rolled it back yesterday. See bug 5257. It will be
back to 0.1 when 3.2.1 rolls out soon.

Comment 48 Sidney Markowitz 2007-06-08 17:32:47 UTC

> It will be back to 0.1 when 3.2.1 rolls out soon.

And Daryl thought to check it into the sa-update channel for 3.2, so you'll be
getting through there right away.

Comment 49 Rob Janssen 2007-06-09 04:01:44 UTC

Thanks. After monitoring the logs (with my spamd patch) for a day, I had changed
the threshold to 0.11 on our system and it looks like it is learning OK.
149 autolearn=ham, 45 autolearn=spam, 105 autolearn=no.  no message that finally
was classified as spam autolearned as ham.
0.1 should do fine as well, except for those hotmail messages (that probably are
a local problem).
I now installed the sa-update again, see that it indeed changed the default to
0.1, and I will monitor again for a while to see if there are no changes that
suddenly cause the auto_learn value to be well above 0.1 for most ham.

Comment 50 Sidney Markowitz 2007-06-09 11:32:23 UTC

Ok, I'm going to close this as FIXED even though the overall issue of how to
best handle autolearning is still something to think about. I think I will
reopen bug 5257 so we don't forget to come up with a better solution before the
next mass check scoring.

Comment 51 Ben Lentz 2007-06-15 07:03:45 UTC

After several days of burn-in, I'm happy to report that bayes appears to be
autolearning and scoring correctly, at least, as accurately as we have become
accustomed to in the past. By FP and FN rates are back on par.

Thanks for your help!