Bug 3082 - Implement History Plugin (which will include expiry) to replace AWL
Summary: Implement History Plugin (which will include expiry) to replace AWL
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 blocker
Target Milestone: 3.4.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 3263 4731 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-02-24 19:33 UTC by Daniel Quinlan
Modified: 2015-04-02 15:34 UTC (History)
9 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
AWL aging patch; AWL wildcard support patch None David Birnbaum [NoCLA]
unified patch patch None David Birnbaum [NoCLA]
awl expiry patch patch None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2004-02-24 19:33:07 UTC
The auto-whitelist currently grows without bound.  Some form of expiry is
needed.

The current format of the DB contains the following two keys for each
email address:

  key => value
    user@example.com|ip=10.20 => 16
    user@example.com|ip=10.20|totscore => 19.41

  where
    user@example.com|ip=10.20 is a unique identifier for the user
    the first key/value pair is a message count
    the second key/value pair is the sum total score of those messages

Options to change DB format:

1. add a new key/value pair with a LRU mtime value or counter (atime is easiest)

   user@example.com|ip=10.20|time => 1077678476

   pros: easy, backward compatible
   cons: bloats DB, crufty, slow

2. change the DB format to be:

   user@example.com|ip=10.20 => [encoded field]

   where [encoded field] contains awl-version,atime,count,total (perhaps last N
   message scores instead of count,total)

   pros: clean, fast, extensible
   cons: requires conversion or switch

Options for conversion or switch:

1. Converting old DB to new DB format per-key in same file

   pros: conversion is spread out, no slow conversion or long locks
   cons: DB never shrinks, requires conversion code

2. Converting old DB to new DB format converting entire file at once

   pros: requires conversion code
   cons: DB can shrink, but may carry over more key/value pairs than will
         ultimately be used with atime-based expiry, so DB could end up larger
         than necessary

3. Just delete old DB and create a new DB

   pros: easiest
   cons: user loses AWL data
Comment 1 Daniel Quinlan 2004-02-24 19:38:42 UTC
Few corrections:

atime vs. mtime for the most part, I meant to use one term the whole way through
but somehow didn't.  atime is really what I meant.

I also switched around pro and con in "Converting old DB to new DB format
converting entire file at once".

Also, per-key conversion would be the most complicated as that would be
integrated into the main codebase.  File-level conversion *could* be put
into a separate tool which makes a lot of sense to me.
Comment 2 David Birnbaum 2004-02-24 20:04:55 UTC
Created attachment 1796 [details]
AWL aging patch; AWL wildcard support

Here is a set of patches against version 2.63 with some white list
modifications that I sent out before.

1.  Supports a new function, age_out_awl, which allows for aging of the
white list.  By default, when called, it will delete any unused entries
older than 30 days.  Time can be specified via epoch seconds, or any
string that Date::Manip can parse.

2.  Provide support for quick removal of AWL entries even in a gigantic
white list.  This is done by linking in tandem with the existing white
list another file indicating which keys are in use.

3.  Provide support for wild-card AWL entries when entered from the
command line.

I have been using this now for several days, and haven't run into any
problems yet.  I'm sure there are bugs lurking which I will beat out.
Still outstanding:

1.  I need to write a test for the age function.

2.  When a large white list is aged out, the DB file doesn't get any
smaller.  I don't know any way around this except to delete the existing
list and make a new one with the same name.
Comment 3 Daniel Quinlan 2004-02-25 01:45:20 UTC
Can you attach your patch in unified format ("diff -u")?

Thanks.
Comment 4 David Birnbaum 2004-02-25 06:01:49 UTC
Created attachment 1798 [details]
unified patch

I should point out that I went through some gyrations in this patch to preserve
some of the methods and function of the current DB stuff.  There's clearly some
ways that it could be speeded up that were outside of how the methods are
currently implemented.
Comment 5 Daniel Quinlan 2004-02-27 01:05:16 UTC
Thanks David.  Some comments:

First, of course, we really need patches against the 3.0 tree in SVN.
We may not release a 2.64 and even if we do, it might not include this
and we'd still need 3.0 code regardless.

> 1.  Supports a new function, age_out_awl, which allows for aging of the
> white list.  By default, when called, it will delete any unused entries
> older than 30 days.  Time can be specified via epoch seconds, or any
> string that Date::Manip can parse.

I think we don't want a special call for this.  It should just happen.
At most, we should have a configuration option for how long to keep these,
but I think even that is probably excessive and unwanted.

> 2.  Provide support for quick removal of AWL entries even in a gigantic
> white list.  This is done by linking in tandem with the existing white
> list another file indicating which keys are in use.

Sounds good, I guess.

> 3.  Provide support for wild-card AWL entries when entered from the
> command line.

This really belongs in a different bug.

> When a large white list is aged out, the DB file doesn't get any
> smaller.  I don't know any way around this except to delete the existing
> list and make a new one with the same name.

Existing DB file, you mean.  Yes, I think shrinking is a must given how
large AWL must be by now.  It might be easiest just to nuke it and start
from scratch.  Read my initial two comments.
Comment 6 Daniel Quinlan 2004-02-27 03:32:09 UTC
Here's what I'm thinking.  Implement a generic shrink function for databases.

Basically:

 - lock the DB
 - copy each key/value pair to a new DB from locked DB
 - copy new DB over old DB
 - unlock DB

or

 - lock the DB
 - copy the DB to a temporary file and unlock original
 - copy each key/value pair to a new DB from temporary copy
 - re-lock the DB
 - copy new DB over old DB
 - unlock DB
 - delete temporary copy

Perhaps, a standalone program.  I doubt it needs to know much about how SA
works.  What do y'all think?
Comment 7 Duncan Findlay 2004-02-27 09:12:20 UTC
Subject: Re:  auto-whitelist database needs expiry

> Perhaps, a standalone program.  I doubt it needs to know much about how SA
> works.  What do y'all think?

+1

I suspect most don't need automatic database shrinking, and it's
probably good to keep it separate.
Comment 8 Theo Van Dinter 2004-03-08 08:13:35 UTC
just to note I agree about this:

-rw-------    1 felicity fame     268972032 Feb 22 05:37 auto-whitelist.db

yes, that's 269MB...
Comment 9 Michael Parker 2004-04-13 09:09:32 UTC
*** Bug 3263 has been marked as a duplicate of this bug. ***
Comment 10 Dallas Engelken 2004-04-13 09:30:43 UTC
Created attachment 1896 [details]
awl expiry patch

patches sa-learn.raw, lib/Mail/SpamAssassin/Conf.pm, sql/awl_mysql.sql,
sql/awl_pg.sql, and sql/README.awl
Comment 11 Dallas Engelken 2004-04-13 09:36:59 UTC
did i need to add --expire-awl to spamassassin.raw instead of sa-learn.raw?  i 
just figured sa-learn was a good place for it because AWL is really 
a "learning" tool in itself, and since bayes and awl will both be sql driven, 
having functions for them in the same binary make sense.. eh?

i have fully tested the expiry code on mysql, but not postgresql...  i'm not 
sure how postgresql handles a TIMESTAMP entry... I know in mysql, any 
modifications done to that row will cause the TIMESTAMP to be set to NOW().  

I was reading somewhere that postgresql needs 
t TIMESTAMP default now,
in its table definition, but I'm not 100% positive.

d
Comment 12 Michael Parker 2004-04-13 10:00:49 UTC
Subject: Re:  auto-whitelist database needs expiry

On Tue, Apr 13, 2004 at 09:37:00AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> did i need to add --expire-awl to spamassassin.raw instead of sa-learn.raw?  i 
> just figured sa-learn was a good place for it because AWL is really 
> a "learning" tool in itself, and since bayes and awl will both be sql driven, 
> having functions for them in the same binary make sense.. eh?

Well....just because Bayes and AWL have SQL backends doesn't mean that
they are the same and you can have one file driven and another SQL
driven, or driven by some totally different storage backend.  sa-learn
has nothing to do with AWL so the command should not go there.  I'm
not 100% sure it should go in spamassassin either but can't off the
top of my head think of a better place.

It does however need to be extended to work with all backends, which
is basically the purpose of this bug.  Unfortunately, your solution is
too SQL specific, and I believe too MySQL specific.  We need to back
out one level and set the last update time in AutoWhitelist.pm then
just store that value.  That would allow us to support this in all
backends and multiple SQL implementations.

> i have fully tested the expiry code on mysql, but not postgresql...  i'm not 
> sure how postgresql handles a TIMESTAMP entry... I know in mysql, any 
> modifications done to that row will cause the TIMESTAMP to be set to NOW().  
> 
> I was reading somewhere that postgresql needs 
> t TIMESTAMP default now,
> in its table definition, but I'm not 100% positive.

I'm no PostgreSQL expert, but I don't believe similar functionality is
available out of the box.  You might be able to do a trigger to handle
the update.  However, the point is moot since we need to handle this
at a higher level than the storage layer.

Michael

Comment 13 Justin Mason 2004-04-13 10:38:50 UTC
agreed with Michael, I'm afraid ;)
Comment 14 Justin Mason 2004-04-16 14:29:59 UTC
here's another idea -- use a statically-sized AWL with a Bloom Filter:
http://www.perl.com/lpt/a/2004/04/08/bloom_filters.html .

Only problem would be that, without expiry, the AWL would eventually "fill up"
the Bloom Filter space, resulting in FPs and FNs.
Comment 15 Justin Mason 2004-04-30 14:57:24 UTC
thinking: do we need an atime value?

In Bayes, we can use a token without changing it.
In the AWL, however, it's impossible to "access" an entry without incrementing
the counter.

So we can assume that any value with count==1 can be expired, AFAICS.  that may
drop a few that were added in the short-term, but I'm not sure that's a severe
problem as long as we don't expire too frequently.
Comment 16 Theo Van Dinter 2004-04-30 15:02:41 UTC
Subject: Re:  auto-whitelist database needs expiry

On Fri, Apr 30, 2004 at 02:57:25PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> So we can assume that any value with count==1 can be expired, AFAICS.  that may
> drop a few that were added in the short-term, but I'm not sure that's a severe
> problem as long as we don't expire too frequently.

I think that would be fine.  In theory, the count==1 will be the same
as the LRU-type group anyway in the big picture scheme of things.

Comment 17 Daniel Quinlan 2004-04-30 15:05:17 UTC
The DB won't actually shrink for a single hit message, though

I think overhauling the AWL for 3.0 is a better option -- it is 3.0 after
all!  Let's get atime in there and do it right.

Also, changing to a single key/value pair per user will increase performance.
Comment 18 Michael Parker 2004-04-30 20:40:40 UTC
Subject: Re:  auto-whitelist database needs expiry


Couple of things.

1) I don't think we should toss out count == 1 entries, unless its
   score is 0.  It's data that we shouldn't lose, unless of course
   it's old and needs to be expired.

2) I agree on the key/value pair strategy, similar to how we do the
   bayes data.  This allows for one record per mail/ip combo rather
   than our current two or three when we add atime.


Is someone taking this and working on it?  If not I might try to
tackle it.

Michael

Comment 19 Daniel Quinlan 2004-04-30 21:07:41 UTC
Subject: Re:  auto-whitelist database needs expiry

> Is someone taking this and working on it?  If not I might try to
> tackle it.

It's yours if you want it.

I'm favoring option 2 from my original proposal:

   change the DB format to be:

   10.20|user@example.com => [encoded field]

   where [encoded field] contains awl-version,atime,count,total (perhaps
   last N message scores instead of count,total)

   pros: clean, fast, extensible
   cons: requires conversion or switch

(I dropped the 'ip=' from the key since it's not needed, especially if
you use split(/,/, $str, 2) and put the IP first.)

Allowing us to do versioning seems like a good idea.

Maybe something like this for [encoded field]:

  awl-version = one byte
  atime       = 4 bytes
  count       = 2 bytes
  total       = 3 bytes

with a /= 2 on both count and total if you get too close to the limit.

Comment 20 Justin Mason 2004-05-05 21:31:24 UTC
actually, I think I withdraw my objection to the atime; I can see that when a
user moves from one ISP to another, their old entry may have > 1 hit, but should
still be expired eventually.

Michael, you working on this?  I might...
Comment 21 Daniel Quinlan 2004-05-16 03:05:44 UTC
> Michael, you working on this?  I might...

Assume it's open until the bug is taken.

I'm still in favor of the new format, but I think we should just *not*
worry about bloated DBs that don't shrink because the DB code can't
shrink the file.  Worrying about renaming/moving/copying is going to be
too much trouble -- just let people delete them or maybe create a
separate offline tool.  In SA itself, let's just delete the entries that
have expired and leave it to that.
Comment 22 Theo Van Dinter 2004-05-29 09:45:49 UTC
moving to 3.1 queue
Comment 23 Daniel Quinlan 2004-08-27 16:51:05 UTC
moving performance and accuracy bugs to 3.1.0 milestone
Comment 24 Daniel Quinlan 2004-08-27 17:25:54 UTC
moving performance and accuracy bugs to 3.1.0 milestone
Comment 25 Michael Parker 2005-04-06 14:34:23 UTC
The History plugin will replace AWL and will have expiration included in it's
feature set.
Comment 26 Bob Menschel 2005-04-28 22:35:16 UTC
*** Bug 3981 has been marked as a duplicate of this bug. ***
Comment 27 Justin Mason 2005-05-10 23:11:59 UTC
IMO we need to do something about this for 3.1.0.   there's always at least 1
message per week regarding gigantic memory usage due to massive AWL files...
this is the last major issue with 3.0.0.
Comment 28 Michael Parker 2005-05-10 23:14:08 UTC
Subject: Re:  Implement History Plugin (which will include expiry) to replace AWL

> ------- Additional Comments From jm@jmason.org  2005-05-10 23:11 -------
> IMO we need to do something about this for 3.1.0.   there's always at least 1
> message per week regarding gigantic memory usage due to massive AWL files...
> this is the last major issue with 3.0.0.

The memory issue has been fixed.

Michael
Comment 29 Bart Verwilst 2005-06-13 01:03:43 UTC
Any idea when 3.1 is due? Or where i can follow 3.1 development? Also, will 
this history plugin work for bayesian too? AWL and Bayes both have over 2 
million records in the mysql db i'm using, and still growing...  
Comment 30 Frank Urban 2005-12-13 08:37:30 UTC
*** Bug 4731 has been marked as a duplicate of this bug. ***
Comment 31 Frank Urban 2005-12-13 13:12:39 UTC
Seemed that this feature is still not included in version 3.1. Why?
Comment 32 Theo Van Dinter 2006-09-10 05:53:53 UTC
With no movement in months, moving to the future milestone.  Hopefully one day
something will happen. :|
Comment 33 Kevin A. McGrail 2011-10-28 16:08:37 UTC
I like Dallas' patch.  As 7 years+ have passed with no one stepping forward to patch this for a DB backend, the switch to 3.4.0 is the ideal time to change AWL's table format and recommend a cron job.

If it only benefits mySQL users, it can be extended later.

I myself can't expire my Bayesian tokens except via cron to avoid mid-day spikes. So I use a cron job to clear my old AWL entries.

But I use an implementation based on Kris Deugau where I add a column to my table with an index:

alter table awl add column `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP;

ALTER TABLE `awl` ADD INDEX ( `lastupdate` ) 

Then I use a cron job nightly that runs these queries:

/usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 15 day) and count < 5;'
/usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 30 day) and count < 10;'
/usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 60 day) and count < 20;'
/usr/local/mysql/bin/mysql spamassassin -u spamassassin -premoved -e 'DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 120 day);'


This seems very in line with Dallas' patch.  However, his solution and mine are only test with mysql and ignore DB and postgres.

So at a minimum, these are good additions for the README.awl that could then close this ticket though a new ticket for the wildcard entries could be open.  That's a good idea that has still never been done but left orphaned in this ticket.
Comment 34 Benny Pedersen 2013-01-22 20:32:26 UTC
i begin to thinking about another ticket now, but here i like to ask others of its needed to one more ticket to delay 3.4.x now

basicly i like others to give feed back on the idear of make awl handle dkim/spf black/white scores that could if both dkim/spf fails could give 200 in awl scores alone based on that ?, and the otherway around on both pass could give -200 in awl score

is this intended ?, should awl have a system to ignore auth scores ?
Comment 35 Frank Urban 2013-01-23 15:53:50 UTC
Today we have the most problems with SPAM send over real existing Yahoo, Hotmail etc Freemailer mailboxes. So if you will give them additional -200 because of a valid DKIM/SPF, the problem will raise...
Comment 36 Benny Pedersen 2013-01-23 18:11:34 UTC
random senders is resolved by blacklist sender domains that permit such domains to continue spamming, and then pr good-user@random-sender-domain.tld whitelist to compensate scores if sender is known in recipents addressbooks

this way random senders got blacklist scores no matter what random sender thay use or try to abuse

well i will see how it goes with history pluging when 3.4.x is stable in gentoo
Comment 37 Frank Urban 2013-01-24 06:43:27 UTC
Never have see Hotmail or Yahoo on a blacklist.
and we don't have access to our recipients addressbooks. So our SpamAssasin can't check anythig against them.
Comment 38 Henrik Krohns 2013-01-24 07:02:52 UTC
Please stop polluting this bug with offtopic nonsense. You should know better Benny, please use the list for discussions.
Comment 39 Kevin A. McGrail 2013-06-21 14:52:45 UTC
Moving to 3.4.1
Comment 40 Kevin A. McGrail 2015-04-02 15:34:50 UTC
This is believed overcome by the addition of TXRep