SA Bugzilla – Bug 2975
bayes_seen database uncontrolled growth
Last modified: 2005-06-01 10:40:22 UTC
The bayes_seen databases needs to have built in expiration like the bayes_toks database to prevent uncontrolled growth. In small installations, this will probably go unnoticed, however, in large installations with thousands of users this can easily account for substantial resource consumption. It seems like after a few weeks, at most, message-ids could safely be forgotten.
yeah, you're right. :( FWIW, just nuking the db with "rm" once a month would probably do the trick acceptably as an interim measure...
Subject: Re: bayes_seen database uncontrolled growth On Wed, Jan 28, 2004 at 12:12:47PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > FWIW, just nuking the db with "rm" once a month would probably do the trick > acceptably as an interim measure... sorta. killing seen makes tie (at least r/o) fail.
Yep, I train my filter at home and copy the tokens to the university where they are used to filter my mail. Since I have very limited disk space there and the bayes_seen file is only useful at home I'd like to be able to leave it out without producing error messages while filtering. Regards, Christian
Michael, I haven't checked but I assume that the SQL Bayes code takes care of this problem?
Subject: Re: bayes_seen database uncontrolled growth On Tue, Apr 13, 2004 at 11:15:12AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Michael, I haven't checked but I assume that the SQL Bayes code takes care of > this problem? No. You can fake it with a lastupdate column in MySQL and just expire by hand, but we don't do anything explicit for expiry. Michael
Does the Bayes framework allow for an expiration to occur? Could it be rolled in with the token auto expiration code?
Subject: Re: bayes_seen database uncontrolled growth On Tue, Apr 13, 2004 at 11:29:45AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Does the Bayes framework allow for an expiration to occur? Could it be rolled > in with the token auto expiration code? No it doesn't. It would take a slight re-design and probably more thought. I see the following possible attributes of things we might want to support in the future: 1) Straight date based expiry. Note Date/Time when msgid was first learned and after N days expire all msgids > N days old. We can then run this at the same time as expiration or some other process could handle this. 2) Similar to 1 but update the timestamp if we try to re-learn the msgid. The thinking here is that for some reason this msgid was re-examined so lets keep it around a little bit longer to avoid re-learning. 3) Keep no record of learned msgids and allow exhaustive learning or explicitly disable learning in this case. This would help folks who learn on one box and copy the bayes_toks file to a production box and have auto_learn turned off. It would also allow for multiple learns on the same message (ie exhaustive learning, see http://garyrob.blogs.com/garys_longer_rants/2004/02/instructions_fo.html via jmason). 4) Log how many times a message has been learned, again see exhaustive learning stuffs. 5) Tie which tokens were learned from a particular msgid and then expire by msgid instead of by token atime. 6) All the various combinations of all of the above. Anything else? Michael
moving accuracy and some bugs to 3.1.0 milestone
more accuracy and performance bugs going to 3.1.0 milestone
*** Bug 2771 has been marked as a duplicate of this bug. ***
we probably should have some way of doing this in 3.1.0 -- even if it's just a support script that wipes out the db and replaces it with a new, empty one.
Justin, my sentiments exactly. A lock-safe equivalent of rm -f bayes_seen would be a pretty desirable tool anyway, and offers at least an interim solution. As for options to solve this the "right way", I just sent this to Michael off-line regarding comment #7 and figured this should be echoed here (with more thoughts added to 3) I'd say option 2) is the most consistent with how SA handles expiry of tokens, and seems the most sensible option. 1)could be workable as well, but 2) strikes me as an improvement. 3) could be implemented as an option that simply disables the bayes_seen portion entirely, and isn't very relevant to expiry as it works by eliminating the need. Unless SA is redesigned to exclusively do things this way, you'll need 1,2,or 5. I don't think that in the general case you want to use this as your normal mode of operation. Protecting against accidental re-learning is good in most environments. 4) While an interesting idea, this doesn't address or solve the problem of expiry. If you went this way you'd still need 1,2, or 5 to solve the boundless-growth problem. See also thoughts on 3. 5) sounds extensively complicated, bulky in terms of storage demand, and of limited gain. I think you'll find that this mechanism would allow the bayes_seen to grow more-or-less without bound anyway. I state this based on the theory that it only takes one unexpired token to retain a message ID, and the majority of messages you train are going to have at least one "frequently seen" token that keeps getting learned in new messages. SA's token expiry is going to favor keeping these "frequently seen" tokens when it expires tokens, because they're going to have a short delta-atime. (And it would be right to keep them, as statistically speaking they are the best candidates to keep) In summary, it sounds like the best thing to do would be 2.
let's at least try to think about a quick fix for this for 3.1.0
ok, trivial fix for file-based dbs in 3.1.0: - we maintain two bayes_seen files: bayes_seen and bayes_seen_old. - if bayes_seen_old doesn't exist (eg. post-upgrade) it's treated as empty - if bayes_seen doesn't exist, both files are treated as empty - once stat(bayes_seen) reports that the file creation time is greater than N days ago, bayes_seen_old is unlinked, bayes_seen is moved to bayes_seen_old, and a new, empty bayes_seen file is created. N would be 90 days by default, let's say. that's very easy and pretty fast to implement, and deals with the problem without adding more fields or upgrading the db format. doesn't help for SQL dbs, or for auto-whitelist though...
Subject: Re: bayes_seen database uncontrolled growth I'd be inclined to veto anything that doesn't include a solution all around. It seems far too hackish to just throw something for this, especially since we're talking about recommending the SQL solution over the Berkeley DB based bayes.
Subject: Re: bayes_seen database uncontrolled growth On Tue, May 10, 2005 at 11:21:38PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I'd be inclined to veto anything that doesn't include a solution all > around. It seems far too hackish to just throw something for this, > especially since we're talking about recommending the SQL solution > over the Berkeley DB based bayes. We need to do something, but a full seen expiry system isn't going to happen for 3.1. I still like the idea of just letting bayes_seen be optional. If people want to trim it, let them delete the file and have it be recreated. IIRC, the only place that's an issue is when going r/o w/ the DB where it requires the file right now.
'I still like the idea of just letting bayes_seen be optional. If people want to trim it, let them delete the file and have it be recreated. IIRC, the only place that's an issue is when going r/o w/ the DB where it requires the file right now.' ok, I can go for that.
ok, fixed; r179482.