SA Bugzilla – Bug 5652
bayes_seen - auto expire
Last modified: 2018-08-31 22:45:20 UTC
bayes_seen db grows without any purge cycle, even if previously learned tokens have long been expired for the main bayes db. Users non-sa saavy often complain of over sized seen db file sizes, at times from 250mb-4GB in size. Request for a new process and variable to control the seen db size... perhaps: Bayes_Unlearn_Threshold_days Where a user could enter a value for how many days to keep the seen DB tokens and expire those older than that threshold. Perhaps a DEFAULT value of 7 days would be in order as most spam campains last a single day at most. A 30 day purge should be more than safe for most anyone and bets a non-expiry system.
I agree this is a good idea. Uncontrolled file growth in bayes_seen and AWL both make SA look rather unfinished. Unfortunately, I'm not a perl jockey, or I'd write a patch myself. In general I'd favor an option to essentially make bayes_seen a crude fifo of sorts, with the depth measured either in days or entries (developer's choice on which to implement, both would work fine.) Unfortunately, this feature will involve changing the format of the bayes_seen database to include an ctime, atime, or entry counter. (depending on the choice of expiry criteria...) I know this feature offers the risk of relearning the same email, but with a reasonable depth (ie: 30 days) this should be completely moot. It's also even more moot if you use an atime instead of a ctime or count, as anyone "batch learning" the same message every day in a cronjob will keep updating the atime each time they try to learn it.
fwiw, I believe the previous discussion on this topic ranged from: - since the seen file can just be deleted (or the sql table can just be cleaned via "delete from"), this isn't a huge issue to - it'd be great if we could have a generic set of expiry code such that we could use it for bayes tokens, bayes seen, awl, etc. there'd have to be code to handle upgrading or otherwise handling the non-expiry versioned data as well.
(In reply to comment #1) > I agree this is a good idea. Uncontrolled file growth in bayes_seen and AWL both > make SA look rather unfinished. Unfortunately, I'm not a perl jockey, or I'd > write a patch myself. > > In general I'd favor an option to essentially make bayes_seen a crude fifo of > sorts, with the depth measured either in days or entries (developer's choice on > which to implement, both would work fine) +1 > Unfortunately, this feature will involve changing the format of the bayes_seen > database to include an ctime, atime, or entry counter. (depending on the choice > of expiry criteria...) +1 > I know this feature offers the risk of relearning the same email, but with a > reasonable depth (ie: 30 days) this should be completely moot. It's also even > more moot if you use an atime instead of a ctime or count, as anyone "batch > learning" the same message every day in a cronjob will keep updating the atime > each time they try to learn it. atime would be overkill I think. IMO, the reason this hasn't been implemented yet is because we're coming up with too-complex fixes; token ctime would be just fine, and there's no need to track msg->token mappings so that tokens can be unlearned.
FWI: In many env. where its not really needed a use_bayes_seen 0 could come in very handy. This would avoid writing the bayes_seen altogether. (space/I/O saver) Alex
I have just had to try to tidy up after having found "bayes_seen" files of nearly three gigabytes tucked away under ".spamassassin" on our campus inbound mail relays filling up our disks. (Our other monitoring caught this before 100%.) I was about to enquire whether we might have an installation problem, but then found this bug report already open. So this is an agreement with #1 from Matt that this is a residual untidyness in an otherwise professional product, that could do with some sort of addressing. I can see that there are discussions about how best to code it. But meanwhile the problem persists. So in the interim could I suggest an FAQ that acknowledges the problem and gives some sort of workaround/fudge, even if that is as simple (even if suboptimal) as removing "bayes_seen". (Can this be safely done without needing to restart SA, MailScanner, etc?) Or perhaps an automatically installed cron job that removes it if it exceeds a certain size. (Obviously a proper solution in the next release would be ideal. But if that is not possible, then some sort of FAQ/known-issue and/or fudge/workaround.) Hope that helps.
So what would be the proper procedure at the moment? I also have several servers with bayes_seen table growing to infinite (a few GB), and would like to add a cron-based cleanup job like I did for awl / tokens. The systems are set up to work automatically (no manual sa-learn operation, so no risk to load twice the same messages). So if I understand correctly (spent an hour browsing archives & faqs), I could simply truncate the bayes_seen table every week or so, or add a timestamp field and remove entries older than 1 week|month|... and the system would still work 100% fine? Thanks for a short confirmation & thanks for your great work, regards from Switzerland, Olivier
> So what would be the proper procedure at the moment? service spamassassin stop rm bayes_seen service spamassassin start Not sure if the start/stop is required, but that's how I've been doing it since the patch was added to make it deletable.
> service spamassassin stop > rm bayes_seen > service spamassassin start well, that's in the case you're not using SQL :-) so I guess the proper way then would be: stop spamd TRUNCATE bayes_seen; start spamd ? regards, O.
> TRUNCATE bayes_seen Well, bayes_seen is not readable text, it seems to be binary data. How would you know where to slice it? It may even be a database of some sort. If it is possible to chop the oldest data and keep the newest that would be the best, but I am not sure that it is possible.
TRUNCATE in (my)sql = empties a table completely. ( http://dev.mysql.com/doc/refman/5.0/en/truncate.html ) Other solution would be to add a timestamp field, and delete entries older than N days/weeks/days: maybe that would be cleaner?
Any official updates on this? Plans to work on it? Just sweep it under the rug and forget about it? Just curious after clearing up yet another obscenely large bayes_seen file.
Uh, this is an old bug. No comment at all in 3 years. (In reply to comment #11) > Any official updates on this? Plans to work on it? Just sweep it under the rug > and forget about it? Just curious after clearing up yet another obscenely > large bayes_seen file. Obviously, no updates, otherwise they would have been reflected here. Also, neither sweeping it under the rug. Otherwise, this bug wouldn't still be open. This still is a valid bug, and could need some dev time. On the other hand, however, lack of comments or duplicates in almost 3 years indicates, this isn't really an issue of strong concern.
> On the other hand, > however, lack of comments or duplicates in almost 3 years indicates, this isn't > really an issue of strong concern. Workaround is simple and to appear bug need time so no comments added, but the bug is still there and still doing harm.
I've created a workaround in my system which is using PostgreSQL database for bayes. It does not need any modifications in SpamAssassin nor additional cron jobs. I add additional column `ctime` to `bayes_seen` table with default `now()`. And create a trigger on insert which runs with probability 1/10000 and deletes rows with `ctime` older than a year and half. alter table bayes_seen add column ctime timestamptz not null default now(); create or replace function bayes_seen_expire() returns trigger as $$ declare run_probability float := 0.0001; remove_before timestamptz; remove_count integer; begin if random()<run_probability then remove_before := current_date-'1.5 year'::interval; -- We need to lock rows in some explicit order -- because there's a small probability that -- concurrent bayes_seen_expire() is running -- which can cause a deadlock select count(*) into remove_count from ( select * from bayes_seen where ctime<remove_before order by id, msgid for update ) as _; if remove_count=0 then return NULL; end if; delete from bayes_seen where ctime<remove_before; end if; return NULL; end; $$ language plpgsql; create trigger bayes_seen_expire after insert on bayes_seen execute procedure bayes_seen_expire();
Still an issue, needs significant dev work to add functionality, but has workarounds. Pushing target.