5652 – bayes_seen - auto expire

Bug 5652 - bayes_seen - auto expire

Summary: bayes_seen - auto expire

Status:	NEW

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Learner (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-09-19 15:57 UTC by Dave Koontz
Modified:	2018-08-31 22:45 UTC (History)
CC List:	7 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dave Koontz 2007-09-19 15:57:33 UTC

bayes_seen db grows without any purge cycle, even if previously learned tokens 
have long been expired for the main bayes db.  Users non-sa saavy often 
complain of over sized seen db file sizes, at times from 250mb-4GB in size.

Request for a new process and variable to control the seen db size... perhaps:

Bayes_Unlearn_Threshold_days

Where a user could enter a value for how many days to keep the seen DB tokens 
and expire those older than that threshold.  Perhaps a DEFAULT value of 7 days 
would be in order as most spam campains last a single day at most.  A 30 day 
purge should be more than safe for most anyone and bets a non-expiry system.

Comment 1 Matt Kettler 2007-09-19 17:19:00 UTC

I agree this is a good idea. Uncontrolled file growth in bayes_seen and AWL both
make SA look rather unfinished. Unfortunately, I'm not a perl jockey, or I'd
write a patch myself.

In general I'd favor an option to essentially make bayes_seen a crude fifo of
sorts, with the depth measured either in days or entries (developer's choice on
which to implement, both would work fine.)

Unfortunately, this feature will involve changing the format of the bayes_seen
database to include an ctime, atime, or entry counter. (depending on the choice
of expiry criteria...)

I know this feature offers the risk of relearning the same email, but with a
reasonable depth (ie: 30 days) this should be completely moot. It's also even
more moot if you use an atime instead of a ctime or count, as anyone "batch
learning" the same message every day in a cronjob will keep updating the atime
each time they try to learn it.

Comment 2 Theo Van Dinter 2007-09-19 17:23:12 UTC

fwiw, I believe the previous discussion on this topic ranged from:

- since the seen file can just be deleted (or the sql table can just be cleaned
via "delete from"), this isn't a huge issue

to

- it'd be great if we could have a generic set of expiry code such that we could
use it for bayes tokens, bayes seen, awl, etc.  there'd have to be code to
handle upgrading or otherwise handling the non-expiry versioned data as well.

Comment 3 Justin Mason 2007-09-20 02:47:37 UTC

(In reply to comment #1)
> I agree this is a good idea. Uncontrolled file growth in bayes_seen and AWL both
> make SA look rather unfinished. Unfortunately, I'm not a perl jockey, or I'd
> write a patch myself.
> 
> In general I'd favor an option to essentially make bayes_seen a crude fifo of
> sorts, with the depth measured either in days or entries (developer's choice on
> which to implement, both would work fine)

+1

> Unfortunately, this feature will involve changing the format of the bayes_seen
> database to include an ctime, atime, or entry counter. (depending on the choice
> of expiry criteria...)

+1

> I know this feature offers the risk of relearning the same email, but with a
> reasonable depth (ie: 30 days) this should be completely moot. It's also even
> more moot if you use an atime instead of a ctime or count, as anyone "batch
> learning" the same message every day in a cronjob will keep updating the atime
> each time they try to learn it.

atime would be overkill I think.

IMO, the reason this hasn't been implemented yet is because we're coming up with
too-complex fixes; token ctime would be just fine, and there's no need to track
msg->token mappings so that tokens can be unlearned.

Comment 4 AXB 2007-09-20 03:43:52 UTC

FWI: In many env. where its not really needed a

use_bayes_seen 0

could come in very handy.
This would avoid writing the bayes_seen altogether.
(space/I/O saver)

Alex

Comment 5 David Lee 2007-11-14 00:43:04 UTC

I have just had to try to tidy up after having found "bayes_seen" files of
nearly three gigabytes tucked away under ".spamassassin" on our campus inbound
mail relays filling up our disks.  (Our other monitoring caught this before 100%.)

I was about to enquire whether we might have an installation problem, but then
found this bug report already open.

So this is an agreement with #1 from Matt that this is a residual untidyness in
an otherwise professional product, that could do with some sort of addressing.

I can see that there are discussions about how best to code it.

But meanwhile the problem persists.

So in the interim could I suggest an FAQ that acknowledges the problem and gives
some sort of workaround/fudge, even if that is as simple (even if suboptimal) as
removing "bayes_seen".  (Can this be safely done without needing to restart SA,
MailScanner, etc?)  Or perhaps an automatically installed cron job that removes
it if it exceeds a certain size.

(Obviously a proper solution in the next release would be ideal.  But if that is
not possible, then some sort of FAQ/known-issue and/or fudge/workaround.)

Hope that helps.

Comment 6 Olivier Mueller 2008-03-10 08:46:32 UTC

So what would be the proper procedure at the moment?  I also have
several servers with bayes_seen table growing to infinite (a few GB),
and would like to add a cron-based cleanup job like I did for awl /
tokens.

The systems are set up to work automatically (no manual sa-learn
operation, so no risk to load twice the same messages). 

So if I understand correctly (spent an hour browsing archives & faqs), I
could simply truncate the bayes_seen table every week or so, or add a
timestamp field and remove entries older than 1 week|month|...  and the
system would still work 100% fine? 

Thanks for a short confirmation & thanks for your great work,
regards from Switzerland,
Olivier

Comment 7 Matt Kettler 2008-03-10 18:02:02 UTC

> So what would be the proper procedure at the moment?


service spamassassin stop
rm  bayes_seen
service spamassassin start

Not sure if the start/stop is required, but that's how I've been doing it since the patch was added to make it deletable.

Comment 8 Olivier Mueller 2008-03-11 01:52:26 UTC

> service spamassassin stop
> rm  bayes_seen
> service spamassassin start

well, that's in the case you're not using SQL :-) 
so I guess the proper way then would be:

stop spamd
TRUNCATE bayes_seen;
start spamd

?  regards, O.

Comment 9 Tom Schulz 2008-03-11 06:13:40 UTC

> TRUNCATE bayes_seen

Well, bayes_seen is not readable text, it seems to be binary data.
How would you know where to slice it?  It may even be a database
of some sort.  If it is possible to chop the oldest data and keep
the newest that would be the best, but I am not sure that it is possible.

Comment 10 Olivier Mueller 2008-03-11 11:44:26 UTC

TRUNCATE in (my)sql = empties a table completely.
( http://dev.mysql.com/doc/refman/5.0/en/truncate.html )

Other solution would be to add a timestamp field, and delete entries older than N days/weeks/days: maybe that would be cleaner?

Comment 11 Paul Graydon 2011-02-25 21:04:38 UTC

Any official updates on this? Plans to work on it? Just sweep it under the rug and forget about it?  Just curious after clearing up yet another obscenely large bayes_seen file.

Comment 12 Karsten Bräckelmann 2011-02-25 21:22:40 UTC

Uh, this is an old bug. No comment at all in 3 years.

(In reply to comment #11)
> Any official updates on this? Plans to work on it? Just sweep it under the rug
> and forget about it?  Just curious after clearing up yet another obscenely
> large bayes_seen file.

Obviously, no updates, otherwise they would have been reflected here.

Also, neither sweeping it under the rug. Otherwise, this bug wouldn't still be open.

This still is a valid bug, and could need some dev time. On the other hand, however, lack of comments or duplicates in almost 3 years indicates, this isn't really an issue of strong concern.

Comment 13 Andrey 2012-05-02 14:55:39 UTC

> On the other hand,
> however, lack of comments or duplicates in almost 3 years indicates, this isn't
> really an issue of strong concern.

Workaround is simple and to appear bug need time so no comments added, but the bug is still there and still doing harm.

Comment 14 Tomasz Ostrowski 2014-10-29 08:59:02 UTC

I've created a workaround in my system which is using PostgreSQL database for bayes. It does not need any modifications in SpamAssassin nor additional cron jobs.

I add additional column `ctime` to `bayes_seen` table with default `now()`. And create a trigger on insert which runs with probability 1/10000 and deletes rows with `ctime` older than a year and half.

alter table bayes_seen add column ctime timestamptz not null default now();
create or replace function bayes_seen_expire() returns trigger as $$
        declare
                run_probability float := 0.0001;
                remove_before timestamptz;
                remove_count integer;
        begin
                if random()<run_probability then
                        remove_before := current_date-'1.5 year'::interval;

                        -- We need to lock rows in some explicit order
                        -- because there's a small probability that
                        -- concurrent bayes_seen_expire() is running
                        -- which can cause a deadlock
                        select count(*) into remove_count from (
                                select * from bayes_seen
                                where ctime<remove_before
                                order by id, msgid
                                for update
                        ) as _;

                        if remove_count=0 then  
                                return NULL;
                        end if;
                        delete from bayes_seen where ctime<remove_before;
                end if;
                return NULL;
        end;
$$ language plpgsql;
create trigger bayes_seen_expire after insert on bayes_seen
        execute procedure bayes_seen_expire();

Comment 15 Bill Cole 2018-08-31 22:45:20 UTC

Still an issue, needs significant dev work to add functionality, but has workarounds. Pushing target.