SA Bugzilla – Bug 3082
Implement History Plugin (which will include expiry) to replace AWL
Last modified: 2015-04-02 15:34:50 UTC
The auto-whitelist currently grows without bound. Some form of expiry is needed. The current format of the DB contains the following two keys for each email address: key => value user@example.com|ip=10.20 => 16 user@example.com|ip=10.20|totscore => 19.41 where user@example.com|ip=10.20 is a unique identifier for the user the first key/value pair is a message count the second key/value pair is the sum total score of those messages Options to change DB format: 1. add a new key/value pair with a LRU mtime value or counter (atime is easiest) user@example.com|ip=10.20|time => 1077678476 pros: easy, backward compatible cons: bloats DB, crufty, slow 2. change the DB format to be: user@example.com|ip=10.20 => [encoded field] where [encoded field] contains awl-version,atime,count,total (perhaps last N message scores instead of count,total) pros: clean, fast, extensible cons: requires conversion or switch Options for conversion or switch: 1. Converting old DB to new DB format per-key in same file pros: conversion is spread out, no slow conversion or long locks cons: DB never shrinks, requires conversion code 2. Converting old DB to new DB format converting entire file at once pros: requires conversion code cons: DB can shrink, but may carry over more key/value pairs than will ultimately be used with atime-based expiry, so DB could end up larger than necessary 3. Just delete old DB and create a new DB pros: easiest cons: user loses AWL data
Few corrections: atime vs. mtime for the most part, I meant to use one term the whole way through but somehow didn't. atime is really what I meant. I also switched around pro and con in "Converting old DB to new DB format converting entire file at once". Also, per-key conversion would be the most complicated as that would be integrated into the main codebase. File-level conversion *could* be put into a separate tool which makes a lot of sense to me.
Created attachment 1796 [details] AWL aging patch; AWL wildcard support Here is a set of patches against version 2.63 with some white list modifications that I sent out before. 1. Supports a new function, age_out_awl, which allows for aging of the white list. By default, when called, it will delete any unused entries older than 30 days. Time can be specified via epoch seconds, or any string that Date::Manip can parse. 2. Provide support for quick removal of AWL entries even in a gigantic white list. This is done by linking in tandem with the existing white list another file indicating which keys are in use. 3. Provide support for wild-card AWL entries when entered from the command line. I have been using this now for several days, and haven't run into any problems yet. I'm sure there are bugs lurking which I will beat out. Still outstanding: 1. I need to write a test for the age function. 2. When a large white list is aged out, the DB file doesn't get any smaller. I don't know any way around this except to delete the existing list and make a new one with the same name.
Can you attach your patch in unified format ("diff -u")? Thanks.
Created attachment 1798 [details] unified patch I should point out that I went through some gyrations in this patch to preserve some of the methods and function of the current DB stuff. There's clearly some ways that it could be speeded up that were outside of how the methods are currently implemented.
Thanks David. Some comments: First, of course, we really need patches against the 3.0 tree in SVN. We may not release a 2.64 and even if we do, it might not include this and we'd still need 3.0 code regardless. > 1. Supports a new function, age_out_awl, which allows for aging of the > white list. By default, when called, it will delete any unused entries > older than 30 days. Time can be specified via epoch seconds, or any > string that Date::Manip can parse. I think we don't want a special call for this. It should just happen. At most, we should have a configuration option for how long to keep these, but I think even that is probably excessive and unwanted. > 2. Provide support for quick removal of AWL entries even in a gigantic > white list. This is done by linking in tandem with the existing white > list another file indicating which keys are in use. Sounds good, I guess. > 3. Provide support for wild-card AWL entries when entered from the > command line. This really belongs in a different bug. > When a large white list is aged out, the DB file doesn't get any > smaller. I don't know any way around this except to delete the existing > list and make a new one with the same name. Existing DB file, you mean. Yes, I think shrinking is a must given how large AWL must be by now. It might be easiest just to nuke it and start from scratch. Read my initial two comments.
Here's what I'm thinking. Implement a generic shrink function for databases. Basically: - lock the DB - copy each key/value pair to a new DB from locked DB - copy new DB over old DB - unlock DB or - lock the DB - copy the DB to a temporary file and unlock original - copy each key/value pair to a new DB from temporary copy - re-lock the DB - copy new DB over old DB - unlock DB - delete temporary copy Perhaps, a standalone program. I doubt it needs to know much about how SA works. What do y'all think?
Subject: Re: auto-whitelist database needs expiry > Perhaps, a standalone program. I doubt it needs to know much about how SA > works. What do y'all think? +1 I suspect most don't need automatic database shrinking, and it's probably good to keep it separate.
just to note I agree about this: -rw------- 1 felicity fame 268972032 Feb 22 05:37 auto-whitelist.db yes, that's 269MB...
*** Bug 3263 has been marked as a duplicate of this bug. ***
Created attachment 1896 [details] awl expiry patch patches sa-learn.raw, lib/Mail/SpamAssassin/Conf.pm, sql/awl_mysql.sql, sql/awl_pg.sql, and sql/README.awl
did i need to add --expire-awl to spamassassin.raw instead of sa-learn.raw? i just figured sa-learn was a good place for it because AWL is really a "learning" tool in itself, and since bayes and awl will both be sql driven, having functions for them in the same binary make sense.. eh? i have fully tested the expiry code on mysql, but not postgresql... i'm not sure how postgresql handles a TIMESTAMP entry... I know in mysql, any modifications done to that row will cause the TIMESTAMP to be set to NOW(). I was reading somewhere that postgresql needs t TIMESTAMP default now, in its table definition, but I'm not 100% positive. d
Subject: Re: auto-whitelist database needs expiry On Tue, Apr 13, 2004 at 09:37:00AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > did i need to add --expire-awl to spamassassin.raw instead of sa-learn.raw? i > just figured sa-learn was a good place for it because AWL is really > a "learning" tool in itself, and since bayes and awl will both be sql driven, > having functions for them in the same binary make sense.. eh? Well....just because Bayes and AWL have SQL backends doesn't mean that they are the same and you can have one file driven and another SQL driven, or driven by some totally different storage backend. sa-learn has nothing to do with AWL so the command should not go there. I'm not 100% sure it should go in spamassassin either but can't off the top of my head think of a better place. It does however need to be extended to work with all backends, which is basically the purpose of this bug. Unfortunately, your solution is too SQL specific, and I believe too MySQL specific. We need to back out one level and set the last update time in AutoWhitelist.pm then just store that value. That would allow us to support this in all backends and multiple SQL implementations. > i have fully tested the expiry code on mysql, but not postgresql... i'm not > sure how postgresql handles a TIMESTAMP entry... I know in mysql, any > modifications done to that row will cause the TIMESTAMP to be set to NOW(). > > I was reading somewhere that postgresql needs > t TIMESTAMP default now, > in its table definition, but I'm not 100% positive. I'm no PostgreSQL expert, but I don't believe similar functionality is available out of the box. You might be able to do a trigger to handle the update. However, the point is moot since we need to handle this at a higher level than the storage layer. Michael
agreed with Michael, I'm afraid ;)
here's another idea -- use a statically-sized AWL with a Bloom Filter: http://www.perl.com/lpt/a/2004/04/08/bloom_filters.html . Only problem would be that, without expiry, the AWL would eventually "fill up" the Bloom Filter space, resulting in FPs and FNs.
thinking: do we need an atime value? In Bayes, we can use a token without changing it. In the AWL, however, it's impossible to "access" an entry without incrementing the counter. So we can assume that any value with count==1 can be expired, AFAICS. that may drop a few that were added in the short-term, but I'm not sure that's a severe problem as long as we don't expire too frequently.
Subject: Re: auto-whitelist database needs expiry On Fri, Apr 30, 2004 at 02:57:25PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > So we can assume that any value with count==1 can be expired, AFAICS. that may > drop a few that were added in the short-term, but I'm not sure that's a severe > problem as long as we don't expire too frequently. I think that would be fine. In theory, the count==1 will be the same as the LRU-type group anyway in the big picture scheme of things.
The DB won't actually shrink for a single hit message, though I think overhauling the AWL for 3.0 is a better option -- it is 3.0 after all! Let's get atime in there and do it right. Also, changing to a single key/value pair per user will increase performance.
Subject: Re: auto-whitelist database needs expiry Couple of things. 1) I don't think we should toss out count == 1 entries, unless its score is 0. It's data that we shouldn't lose, unless of course it's old and needs to be expired. 2) I agree on the key/value pair strategy, similar to how we do the bayes data. This allows for one record per mail/ip combo rather than our current two or three when we add atime. Is someone taking this and working on it? If not I might try to tackle it. Michael
Subject: Re: auto-whitelist database needs expiry > Is someone taking this and working on it? If not I might try to > tackle it. It's yours if you want it. I'm favoring option 2 from my original proposal: change the DB format to be: 10.20|user@example.com => [encoded field] where [encoded field] contains awl-version,atime,count,total (perhaps last N message scores instead of count,total) pros: clean, fast, extensible cons: requires conversion or switch (I dropped the 'ip=' from the key since it's not needed, especially if you use split(/,/, $str, 2) and put the IP first.) Allowing us to do versioning seems like a good idea. Maybe something like this for [encoded field]: awl-version = one byte atime = 4 bytes count = 2 bytes total = 3 bytes with a /= 2 on both count and total if you get too close to the limit.
actually, I think I withdraw my objection to the atime; I can see that when a user moves from one ISP to another, their old entry may have > 1 hit, but should still be expired eventually. Michael, you working on this? I might...
> Michael, you working on this? I might... Assume it's open until the bug is taken. I'm still in favor of the new format, but I think we should just *not* worry about bloated DBs that don't shrink because the DB code can't shrink the file. Worrying about renaming/moving/copying is going to be too much trouble -- just let people delete them or maybe create a separate offline tool. In SA itself, let's just delete the entries that have expired and leave it to that.
moving to 3.1 queue
moving performance and accuracy bugs to 3.1.0 milestone
The History plugin will replace AWL and will have expiration included in it's feature set.
*** Bug 3981 has been marked as a duplicate of this bug. ***
IMO we need to do something about this for 3.1.0. there's always at least 1 message per week regarding gigantic memory usage due to massive AWL files... this is the last major issue with 3.0.0.
Subject: Re: Implement History Plugin (which will include expiry) to replace AWL > ------- Additional Comments From jm@jmason.org 2005-05-10 23:11 ------- > IMO we need to do something about this for 3.1.0. there's always at least 1 > message per week regarding gigantic memory usage due to massive AWL files... > this is the last major issue with 3.0.0. The memory issue has been fixed. Michael
Any idea when 3.1 is due? Or where i can follow 3.1 development? Also, will this history plugin work for bayesian too? AWL and Bayes both have over 2 million records in the mysql db i'm using, and still growing...
*** Bug 4731 has been marked as a duplicate of this bug. ***
Seemed that this feature is still not included in version 3.1. Why?
With no movement in months, moving to the future milestone. Hopefully one day something will happen. :|
I like Dallas' patch. As 7 years+ have passed with no one stepping forward to patch this for a DB backend, the switch to 3.4.0 is the ideal time to change AWL's table format and recommend a cron job. If it only benefits mySQL users, it can be extended later. I myself can't expire my Bayesian tokens except via cron to avoid mid-day spikes. So I use a cron job to clear my old AWL entries. But I use an implementation based on Kris Deugau where I add a column to my table with an index: alter table awl add column `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP; ALTER TABLE `awl` ADD INDEX ( `lastupdate` ) Then I use a cron job nightly that runs these queries: /usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 15 day) and count < 5;' /usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 30 day) and count < 10;' /usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 60 day) and count < 20;' /usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 120 day);' This seems very in line with Dallas' patch. However, his solution and mine are only test with mysql and ignore DB and postgres. So at a minimum, these are good additions for the README.awl that could then close this ticket though a new ticket for the wildcard entries could be open. That's a good idea that has still never been done but left orphaned in this ticket.
i begin to thinking about another ticket now, but here i like to ask others of its needed to one more ticket to delay 3.4.x now basicly i like others to give feed back on the idear of make awl handle dkim/spf black/white scores that could if both dkim/spf fails could give 200 in awl scores alone based on that ?, and the otherway around on both pass could give -200 in awl score is this intended ?, should awl have a system to ignore auth scores ?
Today we have the most problems with SPAM send over real existing Yahoo, Hotmail etc Freemailer mailboxes. So if you will give them additional -200 because of a valid DKIM/SPF, the problem will raise...
random senders is resolved by blacklist sender domains that permit such domains to continue spamming, and then pr good-user@random-sender-domain.tld whitelist to compensate scores if sender is known in recipents addressbooks this way random senders got blacklist scores no matter what random sender thay use or try to abuse well i will see how it goes with history pluging when 3.4.x is stable in gentoo
Never have see Hotmail or Yahoo on a blacklist. and we don't have access to our recipients addressbooks. So our SpamAssasin can't check anythig against them.
Please stop polluting this bug with offtopic nonsense. You should know better Benny, please use the list for discussions.
Moving to 3.4.1
This is believed overcome by the addition of TXRep