SA Bugzilla – Bug 5497
Bayes has become unusable
Last modified: 2007-06-15 07:03:45 UTC
I've have great success with Bayes until very recently, coinciding with the upgrade to 3.2.0. Our users have begin to get a completely unacceptable False Positive rate. No other changes have really been made, except for the upgrade to 3.2.0. I have attempted to truncate my bayes database and re-learn it (it's being stored in a MySQL database). Over only a weekend's time (and with no end-user input, only auto-learning input), the false postives were back, nearly always caused by a large amount of ham being tagged with BAYES_99. I also dropped the tables and recreated them from the provided .sql script, but it doesn't appear to have made a difference. Because of the end-user impact at my site, I've resorted to doing: score BAYES_50 0 score BAYES_60 0 score BAYES_80 0 score BAYES_95 0 score BAYES_99 0 to correct the false positive rate. I'm finding that several of the false positives are from Outlook 200x clients using Microsoft Word as a (crappy) HTML generator... and because both ham and spam share the same MSWord body content, many autolearned tokens are being input as spam. I'm not sure if this specific platform has been a substantial cause of my recent issues or not, just something I thought I'd mentioned. Has any of the bayes code been modified with 3.2.0? Has the learning grace period been shortened? Has the autolearn code been modified? Has the Internet trended in such a way that bayes poisoning is more common than it was a few weeks ago? If so, would you consider lowering the point values for these rules since they're growing less effective? Or am I totally nuts and it's only my site that's having a substantially harder time with bayes accuracy? Thanks. P.S. I did mail the users mailing list before reporting this problem here.
Also, I have a copy of my "poisoned" database that was generating many FPs after this past weekend, if it's helpful. It's 2.2MB compressed.
I suspect that at least of the reason is the change in the auto-learn ham threshold which means that 3.2 auto-learns a lot fewer messages as ham than previous versions did. Which means that if you are just auto-learning, and not calling sa-learn to learn ham, a disproportionate number of spam tokens will be learnt.
It looks like auto-learn is broken. Spamassassin 3.2.0 at our site has not auto-learned any message as ham since May 21st, and continues to auto-learn spam. Strange thing is that I have installed 3.2.0 on May 18th and it has apparently continued to work OK for 3 days. bayes_auto_learn_threshold_nonspam which defaults to 0.1 in the AutoLearnThreshold plugin is set to -1.0 in 10_default_prefs.cf I have set it back to 0.1 in our local conf but sofar it does not seem to help. This is destroying the bayes database! What has happened?
> I suspect that at least of the reason is the change in the auto-learn ham > threshold which means that 3.2 auto-learns a lot fewer messages as ham than > previous versions did. Which means that if you are just auto-learning, and not > calling sa-learn to learn ham, a disproportionate number of spam tokens will be > learnt. I can definitely say that a vast majority (+95%) of our learning is from the > auto-learning system. It has been difficult to get our users to feed significant amounts of both ham and spam to our sa-learn feedback mechanism. Any change in > the auto-learn thresholds are likely effecting the accuracy of our database. Is this something simple I can change to revert back to the pre-3.2.0 setting? What was the previous value for bayes_auto_learn_threshold_nonspam? > USER OPTIONS > The following configuration settings are used to control auto-learn- > ing: > > bayes_auto_learn_threshold_nonspam n.nn (default: 0.1) > The score threshold below which a mail has to score, to be fed > into SpamAssassinâs learning systems automatically as a non-spam > message. > > bayes_auto_learn_threshold_spam n.nn (default: 12.0) > The score threshold above which a mail has to score, to be fed > into SpamAssassinâs learning systems automatically as a spam mes- > sage. > > Note: SpamAssassin requires at least 3 points from the header, and > 3 points from the body to auto-learn as spam. Therefore, the min- > imum working value for this option is 6.
For info: I have backed out the sa-update by rm -r /var/lib/spamassassin/3.002000 and restarting spamd, and now it auto-learns again. So the problem probably is in the updated cf files loaded by sa-update.
I can see the change in the ham autolearn threshold having this effect in interaction with any rule that hits, for example, MS Word pasted into Outlook Express mail, such as EXTRA_MPART_TYPE. We tolerate a rule with as low a score as 1.0 even if it FPs with the justification that it is not scored high enough to cause ham to FP as spam. But here is an example of an unintended side effect. If the result is to cause all of a certain class of mail to have its spam instances learned but none of its ham instances learned, then to the degree that Bayes finds tokens that indicate that class, they will be incorrectly learned as spam signs. I don't have an immediate answer to this, but clearly a result that causes all MS Word pasted into OE mail to be labeled as spam is incorrect.
(In reply to comment #5) > For info: I have backed out the sa-update by rm -r > /var/lib/spamassassin/3.002000 and restarting spamd, and now it auto-learns again. > So the problem probably is in the updated cf files loaded by sa-update. The 3.2.0 release and all 3.2.0 updates have identical bayes_auto_learn* values. Can you identify what exactly you are seeing different? Any useful debug output? 10_default_prefs.cf:bayes_auto_learn_threshold_nonspam -1.0 10_default_prefs.cf:bayes_auto_learn_threshold_spam 12.0 10_default_prefs.cf:bayes_auto_learn 1
We have spamd scanning all incoming messages with a single bayes db. What I see is not a single message after May 21st caused an autolearn=ham entry in the mail log. All have either autolearn=no or autolearn=spam, even when the score is -2.5 or so. No wonder that the bayes db only knows about spam and not about ham after some time, and the bayes score creeps upward. After I removed the update (as described), the first ham message that came in, with a score of -2 due to AWL, immediately logged autolearn=ham. May well be that the thresholds are not changed, but it looks like the special marking of scores to affect the bayes autolearn has changed in such a way that no message is below the notspam threshold anymore.
Isn't it the case that AWL is not included in computing the score for the autolearning threshold? Can you run a few ham mails through spamassassin -t with the updates included and see if there is some rule that consistently FPs with enough points to preclude autolearning as ham? Actually, I'm confused as to how an autolearn ham threshold of -1 can work at all. I would think that we don't have enough negative scoring rules to catch a significant amount of ham at that threshold no matter what.
About 5000 messages from very diverse source were scanned during that interval, and NONE of them marked as autolearn=ham. The first one after I undo the update is OK. I'd say it is not something common among the messages, but something wrong in the update. The value of the auto_learn level, that is compared with the thresholds specified in the config, does not seem to be logged or displayed anywhere so it is difficult to see what is really happening. I tried to do a diff between base 3.2.0 and the update. It looks like a big change was made to the scoring so it is difficult to pinpoint where the problem is.
> I'd say it is not something common among the messages, > but something wrong in the update What I am suggesting is that to find out what about the update is wrong, use spamassassin with the -t option with the updates included to run a sample of the ham messages that you would have expected to have been learnt. The -t option shows you exactly which rules fired and how many points they added to or subtracted from the score. If you do that you may be able to see if some rule or rules from the update consistently fire, preventing autolearning of ham.
Sorry I missed this in a previous response: > bayes_auto_learn_threshold_nonspam which defaults to 0.1 in the > AutoLearnThreshold plugin is set to -1.0 in 10_default_prefs.cf > I have set it back to 0.1 in our local conf but sofar it does not seem to help. And I agree... when would a ham email ever score a -1.0 under normal conditions.? Holy smokes! Even a SPF/DK/DKIM email is -0.0 for each test. Without a BAYES_00, or HABEAS/HASHCASH rule, things are rarely below 0.0. I can also confirm that I've been updating my 3.2.0 system regularly with sa-update. So if this is indeed an issue introduced in a sa-update of re-scoring rules (it seems it's been confirmed that the bayes stuff itself, at least on the surface, has remained unchanged), then it's possible that I am exhibiting it. Should I set bayes_auto_learn_threshold_nonspam "back" to 0.1, remove the sa-update config files, disable my regular sa-updating, destroy my current bayes database and start over?
> Should I set bayes_auto_learn_threshold_nonspam "back" to 0.1, > remove the sa-update config files, disable my regular sa-updating, > destroy my current bayes database and start over? I don't think that the auto_learn_threshold should ever be blindly set to an arbitrary number. I might even argue that we should not try to provide a default value for the autolearn threshold. You can look over a large sample of ham and spam to see what makes sense as thresholds at your site. Without having to carefully verify every mail you probably can come up with a pretty good number for a score below which you can be pretty certain that you hardly ever see spam and another number for a score above which you never see ham. Then use those numbers. If the bug here is that an auto_learn_ham threshold of -1 makes no sense, then the fix is to chabge the threshold to something that makes sense for your installation, not elimnate sa_update. On the other hand, if this problem appears after an sa-update, I really would like to see what you can learn about the affect of the sa-updateed rules by looking at the output of spamassassin -t as I mentioned in comment #11. There could be something not quite right about some updated rule even if the overall problem can be characterised as a bad auto_learn_ham threshold.
The only changes in the current updates are (unless someone has changed a sandbox rule that ships in the base release): - URIBL_BLACK and URIBL_GREY now work again - the new sandbox rules added, with scores in 72_scores.cf - some fixes to the comment blocks in some files
I can say, with great certainty, that the default values provided in all versions prior to 3.2.0 seemed to work great at my site without any tweaking or heavy analysis. I don't wish to change anything blindly, instead, I very specifically want the functionality I had before the upgrade - back. I am not asserting that sa-update was the cause, I was only offering it as a relevant piece to my SA configuration.
Ben, please, I understand that you would just like to revert to a time that this problem did not exist, but it would help in moving forward to understand exactly how things are going wrong. One apparent problem is that as of version 3.2 the threshold for ham was set in 10_default_prefs.cf as -1.0, which is hard to imagine would trigger on much if any spam. Another apparent problem seems to have something to do with a change that was installed by sa-update, as we see from comment #5, and that seems to happen even with the threshold set to 0.1. Rob Janssen and Ben Lentz, to figure out exactly what is going on I would like to see the results of running spamassassin -t -D on a message that should be autolearned as ham but is not, in a system that has been updated by sa-update. Since it would have to be ham, pick something that does not have any private information in it to make it as simle as possible to clean it of anything you do not want published. Anything that is long should be attached to this bug as an attachment in Bugzilla using the Create a New Attachment link, not pasted into a comment. Whether or not it makes sense to increase the ham autolearn threshold, comment #5 hints at there being something else wrong that we need to fix, and to do that we need some information.
I want to reference bug 5257 here as related to this so we have it documented. That's the bug that was the justification for changing the default ham autolearning threshold to -1.0
> I would like > to see the results of running spamassassin -t -D on a message that should be > autolearned as ham but is not, in a system that has been updated by sa-update. Hey Sidney, I completely understand that, too. Can you define for me what you mean by "should be" autolearned as ham? By _current_ definition of the default SA 3.2 config, I would interpret "should be" as a message that otherwise scores at <= -1.0 without bayes, which, even Justin admits in 5257, is 1.21% of ham. It might take me a while to locate such a highly rare message... Or, do you mean "should be" as in the _old_ definition of a ham? e.g. <= 0.1? That, I imagine, would be much easier. Thanks for the help thus far...
> Can you define for me what you mean by > "should be" autolearned as ham? I think I'm guilty of conflating your comments with Rob Janssen's comments. If the only thing that is the matter with your setup is that an autolearn threshold for ham of -1.0 is too low to ever trigger, then all you need to do is to set it up to a number that does work for you. Whether you determine that number from past experience with version 3.1 and make it 0.1, or you actually look at the range of scores in your ham and pick a number that will capture most of the true ham with almost no false hits, either way will be better than what you have now. But if, like Rob, you find that after running sa-update even a ham autolearn threshold of 0.1 never learns, then we need to find out why _that_ happens, and that's what the test of spamassassin -t -D would be for. So for that test I would want to see the results for something that should be learned as ham with the higher threshold of 0.1 but is not. I guess your next steps are simple. Change the threshold in your local.cf to 0.1, clear your Bayes database, and see if you start autolearning any ham. If it works, you are done with this problem. If, like Rob, you don't start autolearning ham, then run spamassassin -t D on some ham that you are pretty sure should have been autolearned and we'll try to figure out why.
Maybe my conclusion that it is ONLY related to the update was a bit quick, but I still cannot explain the course of events...: - before installing 3.2.0 it worked fine. messages were regularly autolearned as ham, and most spam was autolearned as such. of course there were also a lot of autolearn=no messages. - after installing 3.2.0 (and without touching the threshold) it looks like it has worked for 3 days and then no more ham was learned. it may be that after 3 days the sa-update was installed, but I think I ran that manually immediately after install. it could also be that there was a new update around the 21st. - yesterday I backed out the update and it seemed it worked, but now I see that it has only autolearned on two out of 100 messages so probably still a problem. Now I manually set the threshold to 0.1 and will watch it again for a day before doing the sa-update again and see if that changes anything. Sure I tried some "spamassasin -t" but there is little that can be learned from its output, because there is no indication which of the scores are actually counted for the auto_learn and what the auto_learn value finally is. There seems to be a lot of magic in the code that decides whether to auto-learn or not, and only the final decision is logged.
> I tried some "spamassasin -t" but there is little that > can be learned from its output That's why you would also used -D, but that produces so much output that you would not use that until you have determined that there is a bug to be found. So first see what happens when you test carefully with different thresholds and with and without the updates. If you can show that it is the update that makes the difference even with the higher threshold, then running spamassassin -t -D can be used to produce heaps of information that we can pore through to figure out what is going on.
there's an assumption here that the Bayes subsystem cannot deal with training databases heavily biased in one direction over the other. However, I'm pretty sure this is not correct. The bayes algorithms take this into account, so it should be compensated for just fine. Hence, the small number of ham hit with a -1.0 threshold (about 1.21% of ham mail according to mass-checks, see bug 5257) should be no problem to it. if your site sees a different ham/spam ratio, however, you may want to make a site-specific customisation to increase your autolearn-ham threshold...
> I don't think that the auto_learn_threshold should ever be blindly set to an > arbitrary number. I might even argue that we should not try to provide a default > value for the autolearn threshold. > > You can look over a large sample of ham and spam to see what makes sense as > thresholds at your site. Without having to carefully verify every mail you > probably can come up with a pretty good number for a score below which you can > be pretty certain that you hardly ever see spam and another number for a score > above which you never see ham. Then use those numbers. in passing, this was suggested ages ago in bug 1829 as an alternative autolearning algorithm; unfortunately that bug got derailed by random discussion, and was never implemented :( it's a good idea.
Here is an example of spamassassin -t -D output: [12770] dbg: plugin: Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH(0x8ee5b7c) implements 'autolearn_discriminator', priority 0 [12770] dbg: learn: auto-learn: currently using scoreset 3, recomputing score based on scoreset 1 [12770] dbg: learn: auto-learn: message score: -2.43672972972973, computed score for autolearn: 0.001 [12770] dbg: learn: auto-learn? ham=0.1, spam=6, body-points=0.001, head-points=0.001, learned-points=-2.599 [12770] dbg: learn: auto-learn? yes, ham (0.001 < 0.1) [12770] dbg: learn: initializing learner [12770] dbg: learn: learning ham Now it auto-learns, but clearly only because I now have changed the threshold. The score of this message is: AWL,BAYES_00,HTML_MESSAGE This is the most common score for our incoming ham-mail. HTML_MESSAGE scores as 0.001 and the other two are negative scores but apparently aren't counted. It looks like the main cause of trouble is the negative nonspam threshold. It may be that the problem gets worse when I install the update again, because there were a couple of new scores in there. But first I want to observe this setting for a day.
> The bayes algorithms take this into account, so it > should be compensated for just fine It should compensate for different absolute numbers of ham vs spam in the collection, but it can't compensate for a collection process that biases against some class of ham. For example, consider that all Outlook Express mail that contains embedded graphics in HTML as cid MIME objects trigger the EXTRA_MPART_TYPE rule for 1.0 point. No ham that has that will be autolearned and all high scoring spam that has that will. If there are any tokens that are characteristic of that kind of mail, the effect will be to amplify the EXTRA_MPART_TYPE FP from producing just 1 extra point to producing 1 plus a high score from Bayes. That's how I interpreted what is going on here. The summary describes that kind of amplification of FPs on tokens found in MS Word generated HTML.
(In reply to comment #13) > You can look over a large sample of ham and spam to see what makes sense as > thresholds at your site. Without having to carefully verify every mail you > probably can come up with a pretty good number for a score below which you can > be pretty certain that you hardly ever see spam and another number for a score > above which you never see ham. Then use those numbers. Is there a flag setting that allows you to do this? I.e. something that I can invoke with a directory full of ham and spam messages and prints a list of auto_learn values of those messages? I know how to list the final spamassassin scores of a set of messages, but that is of little or no value in this case because the main negative scorers are not counted for auto_learn.
> that is of little or no value in this case because the main > negative scorers are not counted for auto_learn That's a good point. Perhaps we should add another word to the header template for showing the autolearn score. That would be very easy to do.
I created an RFE, bug 5502, and uploaded a patch there that adds a _AUTOLEARNSCORE_ tag you can put in a header to see what score autolearn would see in messages. The patch shouild apply cleanly to either trunk or the 3.2 branch.
(In reply to comment #22) > the small number of ham hit with a -1.0 threshold (about 1.21% of ham > mail according to mass-checks, see bug 5257) should be no problem to it. If the change of a threshold to -1.0 was intended to reduce the number of ham message fed into bayes, in my view it is a wrong solution to the problem. Moving a threshold into gaussian periphery results in a skewed/distorted view being presented to autolearning. If the only intention is to reduce the amount of ham fed to autolearner, some decimation can be used, like comparing a random generator value to some threshold, say 0.2 would keep every fifth ham on the average for autolearning and ignore the rest.
> The summary describes that kind of amplification of FPs on tokens found in MS Word generated HTML. This was simply an observation whereas ham messages would contain similar tokens to spam messages (I verified this by hand), and it should not be assumed that this is the main cause of the FP amplification. There were certainly hams scoring BAYES_99 encoded entirely in plain text, too.
Just a newbie comment, but: Shouldn't the goal to be to autolearn ~80% of spam and also to autolearn ~80% of ham, too? Why is it a good thing(tm) to target autolearning ~1% of ham? Just in general?
My interpretation of the comments in bug 5257 is that people were reporting problems with a threshold of 0.1 because of too many low scoring spam being incorrectly learned so the ham threshold was lowered to -1.0. The comment there about it getting 1.21% of the ham was not to say that such a small percentage is desirable, it meant that in our mass check corpus a threshold of -1.0 did find 1.21% and that is enough for Bayes to function. It sounds like at your site a -1.0 threshold gets much less than 1.21% of the ham and it isn't enough for bayes to function.
Don't these autolearn thresholds need to be _proportional_? Using your corpus as a metric, if you're only autolearning 1.21% of ham, don't you want to adjust the bayes_auto_learn_threshold_spam value to only autolearn 1.21% of spam, too? Otherwise, what I'm seeing at my site (regardless of size, I believe), is that the bayes data is much less accurate and is more easily be influenced by bayes poisoning (image spam + normal dictionary words, sentences, and paragraphs hidden behind or below the image)... without a proportional ham differential.
perhaps the best fix, then, is to make autolearning as *spam* only happen if by doing so, the database counts aren't rendered unbalanced? e.g. if autolearn would cause one type to include 2x as many mails as the other type, then skip it.
> perhaps the best fix, then, is to make autolearning as *spam* only happen if by > doing so, the database counts aren't rendered unbalanced? So, before doing an autolearn, it would check the query counts in the bayes database beforehand to make sure the additional learn wouldn't unbalance the database? Sounds like a sound solution, although it sounds like it'd be quite a chunk of work... Currently, my ratio is approximately 16:1 spam learned to ham learned.
No, I think that the principle that the Bayes learner is relatively insensitive to the absolute numbers of ham and spam is correct. The problem is that it is sensitive to getting non-representative samples. To the degree that tokens in the very lowest scorers are not representative of the tokens in all ham, the learner will not be accurate. Autolearning is a substitute for a mechanism in which all ham and all spam are correctly learnt. The more you get away from that ideal by being conservative with the threshold, the weaker it will make Bayes.
> perhaps the best fix, then, is to make autolearning as *spam* only happen > if by doing so, the database counts aren't rendered unbalanced? I don't think this is a good idea, I hope there is a better solution, like dropping old spam. In my experience it is important that new spam samples are incorporated into a bayes db soon, so that the whole system is able to react to new spam profiles and techniques. If a slow trickle of ham will be able to block learning of new spam, it would block such swift response.
> No, I think that the principle that the Bayes learner is relatively > insensitive to the absolute numbers of ham and spam is correct. The problem > is that it is sensitive to getting non-representative samples. I think my experience is to the contrary. I believe my bayes database has taken a serious accuracy hit as the result of a lower number of autolearned hams from the threshold setting change introduced in 3.2.0. I'd say it's very sensitive, given that the characteristics of what I'm learning and what I'm not have otherwise remain unchanged.
(In reply to comment #32) > My interpretation of the comments in bug 5257 is that people were reporting > problems with a threshold of 0.1 because of too many low scoring spam being > incorrectly learned so the ham threshold was lowered to -1.0. It is not clear if the problem is that spam is incorrectly learned as ham, or that there are just too many learning operations going on on a heavily loaded system and it would be desirable to cut it down a bit. When the latter is the actual problem, I suggest the method from comment #29 to be used, not a change of the threshold. Problem is that in our case we apparently get no messages below auto_learn -1.0 at all. Even at 0.1 there are many ham messages not handed to auto_learn because it is so easy to get above 0.1 when AWL and BAYES_00 are not counted. Rules like RDNS_NONE are firing half of the time here, HTML_MESSAGE almost always. This means that even a slight change in the scoring will easily disable the autolearn=ham. This probably explains the change in behavior when installing the update to 3.2.0 I will try the mentioned patch and see what a more reasonable value for the threshold is in our case. I can understand that you want to avoid feed-forward lockups by excluding the score of BAYES_xx in the calculation, and to a lesser extent I can understand the exclusion of AWL, but all together it makes the auto_learn quite fragile. Something that also affects our Bayes DB is that we are a locally operating company where 99+ % of all mail is in Dutch. So the Bayes engine has learned over time that Dutch=HAM and English=SPAM. This normally works well, but when someone sends a message from freemail providers that tag an English commercial under each mail, and they send only an attachment with little body text, it is scored at Bayes_80 or more, and lifted over our spam threshold by simple things like omitting the subject. And those messages are never learned as ham because those freemail providers invariably score points in the "ignorance" and "HTML" categories. So our Bayes DB never gets learned that "Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool." does not really mean the message is SPAM.
> Problem is that in our case we apparently get no messages below > auto_learn -1.0 at all. Even at 0.1 there are many ham messages > not handed to auto_learn because it is so easy to get above 0.1... You are not alone. See recent topics on the mailing list: - bayes autolearn - nonspam threshold - Bayes problem: very large spam/ham ratio From Fletcher Mattox: After years of stability, my bayes db is doing poorly. When I first noticed it, it was classifying lots of ham BAYES_99, I cleared the db and started over. Now it finds *very* few ham. [...] The first three lines are the only autolearned ham. That's it. All day. There were thousands of hams which were not learned. Notice the quantum leap between -4.299 and 12 in the score used by auto-learn [...] I have restored the original 0.1 and bayes has started working again for me. I am now auto-learning much more ham than I was yesterday. [...] From Duane Hill: Therefore, anyone using SA 3.2 who was a prior version without the bayes_auto_learn_threshold_nonspam setting will ultimately have to set the value now to 0.1. I agree the -1 is unrealistically low. I had to force it back to 0.1 too.
(In reply to comment #28) > I created an RFE, bug 5502, and uploaded a patch there that adds a > _AUTOLEARNSCORE_ tag you can put in a header to see what score autolearn would > see in messages. I applied this patch, then tried to lookup how the log line in spamd can be redefined. Looking in the code I found that it is hardwired (line 1623 in spamd). Is there any reason why the log line is not configured using those tags? For now I patched spamd to add the autolearnscore in parentheses after the action. push(@extra, "autolearn=".$status->get_autolearn_status().'('.$status->get_autolearn_points().')');
Rob, adding template tags to the spamd log would be a good enhancement. I'll open a RFE to that effect but I don't think it will make it into 3.2.1. Your patch certainly makes it more convenient to track the autolearn scores than what I was thinking of -- adding it to an X-header and then having to collect and scan the headers of the actual emails that have been processed.
Re comment #40: > I agree the -1 is unrealistically low. I had to force it back to 0.1 too. The subject of people having problems with Bayes because it learns too much spam as ham has been around for years, and the change of the threshold was an attempt to fix this. However, it is my (perhaps faulty) recollection that most people that had to change the threshold to avoid learning spam as ham changed the threshold to - 0.1 rather than -1. It might be interesting to see if threshold values of 0 or -0.1 would still work for some of the people that are having problems with the -1 value. (Alternately stated: I think the concept of lowering the threashold was probably reasonable. It is just the value picked for the lowered threshold that seems probably unreasonable. A value closer to 0 or slightly negative might solve the problems for the people with spam-as-ham problems while not hurting people that were fine with the old threshold.)
Changing from -1.0 to -0.1 might be useful, however the stock spamassassin ruleset contains very few rules that are negative-scoring to begin with, and even fewer that are between these two levels. Ignoring bayes (as it's ignored) the only extra rules you're adding that can cause autolearning are: score RCVD_IN_IADB_OPTIN_GT50 0 -0.499 0 -0.245 score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1 score HABEAS_CHECKED 0 -0.2 0 -0.2 And if you've got hashcash on: score HASHCASH_20 -0.500 score HASHCASH_21 -0.700 It's a start, but I think a better long-term solution would be to do what I've been doing on my server for quite a while. Use a very small negative score as a threshold (-0.001) and introduce several "nice" rules with these small negative scores. Since the rule scores are small, you can't get the historical problem that caused most of the negative-scoring comp rules to be wiped out. In that, spammers crafted their emails to rack up large numbers of these rules, and effectively whitelist the message. Here the scores are too small, you could rack up 20 of em and only get -0.02 for your efforts. Spammers could still do the same thing to make their messages "qualify" for autolearning, but they'd also have to avoid all the spam rules. The old system of having a small positive score had the problem that nonspam autolearning was more or less "by default", as long as you didn't hit any spam rules. This meant that some new variant spams wound up being autolearned as nonspam. My suggestion here still has the same basic problem, but it at least adds some hoops to jump through in order to qualify. The biggest problem here would be crafting rules that would at least be somewhat difficult for spammers to arbitrarily add to their messages. My rules don't meet this, as they rely on being "secret" to avoid detection, and are largely based on "industry keywords" for my company. Actually, That inspired me, why not make it a user-configured "goodwords" file? Then they could add words related to *their* company and/or personal interests. A plugin could scan for any of these words and trigger a single -0.001 scoring rule. We'd just have to pick a good name to avoid people thinking it was a whitelist system :)
> and trigger a single -0.001 scoring rule. To extend this slightly, along the lines of your "several small-scoring rules" this might be something that allowed multiple words or phrases and scored each one at -.001, with a cumulative limit on the score of -.05 or some such. This would allow users to specify multiple things that seemed unique to them. Of course this could be done today with simple body rules for the most part.
(In reply to comment #44) I think it is questionable practice to put a default setting in the distribution and then expect all the users to add custom rules to keep it from failing. When users want to do what you want, they can easily put their custom bayes_auto_learn_threshold_nonspam value into local.cf and add their custom rules. The default setting for bayes_auto_learn_threshold_nonspam should, IMHO, be such that the system at least does some ham auto_learning. My suggestion is to set it back to the old default 0.1 and leave the tinkering with negative values to those that have an actual problem and want to spend effort on it. I don't see any problem at level 0.1 and now have even increased it to 0.2 because we have so many mails from hotmail that all score AWL,BAYES_00,HTML_MESSAGE,RDNS_NONE,SPF_PASS for an autolearn score of 0.101 (and a final score of -3 or so). I want their taglines learned as ham, not spam, because they else FP the bayes check when a file is sent with little or no message body.
> My suggestion is to set it back to the old default 0.1 We (the developers) agree. We rolled it back yesterday. See bug 5257. It will be back to 0.1 when 3.2.1 rolls out soon.
> It will be back to 0.1 when 3.2.1 rolls out soon. And Daryl thought to check it into the sa-update channel for 3.2, so you'll be getting through there right away.
Thanks. After monitoring the logs (with my spamd patch) for a day, I had changed the threshold to 0.11 on our system and it looks like it is learning OK. 149 autolearn=ham, 45 autolearn=spam, 105 autolearn=no. no message that finally was classified as spam autolearned as ham. 0.1 should do fine as well, except for those hotmail messages (that probably are a local problem). I now installed the sa-update again, see that it indeed changed the default to 0.1, and I will monitor again for a while to see if there are no changes that suddenly cause the auto_learn value to be well above 0.1 for most ham.
Ok, I'm going to close this as FIXED even though the overall issue of how to best handle autolearning is still something to think about. I think I will reopen bug 5257 so we don't forget to come up with a better solution before the next mass check scoring.
After several days of burn-in, I'm happy to report that bayes appears to be autolearning and scoring correctly, at least, as accurately as we have become accustomed to in the past. By FP and FN rates are back on par. Thanks for your help!