Bug 3331 - [review] Bayes option to keep original token as db data (not key).
Summary: [review] Bayes option to keep original token as db data (not key).
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: 3.0.0
Assignee: Michael Parker
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-04-29 07:12 UTC by David Koppelman
Modified: 2004-08-15 14:30 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Proposed implementation. patch None David Koppelman [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Data Collection Spreadsheet application/octet-stream None Michael Parker [HasCLA]
Data Collection Spreadsheet application/octet-stream None Michael Parker [HasCLA]
Data Collection Spreadsheet application/octet-stream None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description David Koppelman 2004-04-29 07:12:28 UTC
With r10394 (bug 3225) the Bayes database no longer retains the
original token, instead it uses a hashed version as a key.  As noted
elsewhere there are many good reasons to do this, such as compactness,
efficiency, and privacy.  However one can no longer get the original
token from the database and so those tuning Bayesian classification
can no longer get token statistics such as the hammiest or spammiest
tokens.

For that reason we might implement Sidney Markowitz's suggestion [bug
2266 comment 14] that as an option the original tokens be retained.
Michael Parker pointed out the maintenance difficulties of having to
support a database which could be keyed either on a hash or the full
token [bug 2266 comment 15].  Instead the unhashed token might be
stored as data while still using the hashed token as a key.  If the
original tokens are not to be stored then a NULL or zero-length string
would be stored, otherwise the original token.  This way there is
just one database format.  

I'm assuming that a string field holding only NULLs will have little
performance impact in dbfile and the SQL implementations, so that
users not keeping original tokens will be unaffected.

If there are no comments on this after a few days I'll cobble together
a patch for the DBM code.
Comment 1 Michael Parker 2004-04-29 07:30:31 UTC
Subject: Re:  New: Bayes option to keep original token as db data (not key).

On Thu, Apr 29, 2004 at 07:12:30AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> efficiency, and privacy.  However one can no longer get the original
> token from the database and so those tuning Bayesian classification
> can no longer get token statistics such as the hammiest or spammiest
> tokens.

I guess I fail to see how this is useful.  You can now get the
significant tokens in the headers now (Bug 2266).  Are you hand
tweaking your bayes database?

Comment 2 David Koppelman 2004-04-29 07:42:18 UTC
This is not for statistics on an individual message, but the entire
database.  

For example, the hammiest tokens in my database come from the address
and fax number where I work since they are in the sigs of alot of my
email.  Amusingly, one of the spammiest tokens is "a.html".

This information might be useful in coming up with improved tokenizing
or token weighting schemes.  It's also a good way to spot some bugs,
such as atime problems.

And no, I'm not hand-tweaking the database, that would be way too
labor intensive to be sustainable!
Comment 3 Michael Parker 2004-04-29 09:02:11 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

I'm -0 on doing this.  It was debated for a long time prior to the
change going in.  It's a win-win all around as far as performance.

I struggled with making the plain text tokens go away but the
performance gain out weighed my desire to see them in their plain text
form.

Since I rarely find myself needing to look at the actual tokens and
their values to tweak I'll defer to others.

Michael

Comment 4 David Koppelman 2004-04-29 09:16:47 UTC
Just as a reminder, the change I'm suggesting would not require a
separate database format and might not have an impact on performance
for those not storing the original tokens. See the opening comment
(comment 0?).
Comment 5 Michael Parker 2004-04-29 09:25:23 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

On Thu, Apr 29, 2004 at 09:16:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Just as a reminder, the change I'm suggesting would not require a
> separate database format and might not have an impact on performance
> for those not storing the original tokens. See the opening comment
> (comment 0?).

It has all sorts of implications, not just database format.  And yes,
it would need to be a config option.

1) How do you handle backup/restore?

2) Where are you going to put this data?  In a seperate db file/table
   or alongside the current data?  Be careful, your answer may require
   a ton more work.

3) Maintenance issues.

4) How do you handle the case where someone has this option turned
   off, but then turns it on?

Patched welcome of course.

Michael

Comment 6 David Koppelman 2004-04-29 09:58:18 UTC
Michael Parker wrote:

> It has all sorts of implications, not just database format.  And yes,
> it would need to be a config option.

Default off, of course.

> 1) How do you handle backup/restore?

I don't see the issue, other than having to add code to write or read
the original token.

> 
> 2) Where are you going to put this data?  In a seperate db file/table
>    or alongside the current data?  Be careful, your answer may require
>    a ton more work.

I'd put the original token in the same record used to hold the other
information (num ham, num spam, atime).  I'm assuming db_file and SQL
can handle a variable-length string field and that their
implementations do so in a reasonable manner. (That is, there would be
no big performance difference between a table without variable-length
string fields and one with a string field that only stores nulls.)

For a new token an original token (OT) field would be initialized to
the token or to a null, depending on the option.  When reading a token
the OT field would be ignored if the OT option were off.

The OT data would only be used for statistics, so its absence would
not affect normal operation.

Turning the option on and off should present no problems because there
is no change in database format.  There are several ways of dealing
with the stored tokens when the option is turned off.  One is to
immediately delete all the tokens. (Time consuming, especially for
someone who turned OT off by accident.)  Instead of immediately
deleting the tokens a separate "wipe tokens" command could be
provided.  I'm assuming the OT option would only be kept by those who
know what they're doing, and so the slight complication of a separate
"wipe tokens" command is acceptable.

> 
> 3) Maintenance issues.

That's why the original token is a data field, and not a key.

> 4) How do you handle the case where someone has this option turned
>    off, but then turns it on?

See above.

> 
> Patched welcome of course.

I don't have SQL installed so there is no way I could test any changes
for that back end.  If no one else starts working on the changes in a
few days I'll make them for dbfile (DBM).
Comment 7 Michael Parker 2004-04-29 10:09:54 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

On Thu, Apr 29, 2004 at 09:58:19AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> > 1) How do you handle backup/restore?
> 
> I don't see the issue, other than having to add code to write or read
> the original token.
> 

So when you run backup do you get the hashed token or the original
token?  What if not all original tokens are available?

> > 
> > 2) Where are you going to put this data?  In a seperate db file/table
> >    or alongside the current data?  Be careful, your answer may require
> >    a ton more work.
> 
> I'd put the original token in the same record used to hold the other
> information (num ham, num spam, atime).  I'm assuming db_file and SQL
> can handle a variable-length string field and that their
> implementations do so in a reasonable manner. (That is, there would be
> no big performance difference between a table without variable-length
> string fields and one with a string field that only stores nulls.)

You just changed the database format.  It's easy enough to just add
onto a value in DBDm but since we pack several items into a single
value it does require a database format change.  SQL does not use
variable length columns for data storage anymore.  So you would have
to add a column to the bayes_token table to place this value in, so it
would change the database format.

> > 
> > Patched welcome of course.
> 
> I don't have SQL installed so there is no way I could test any changes
> for that back end.  If no one else starts working on the changes in a
> few days I'll make them for dbfile (DBM).
> 

Like I said, I welcome any patches, I think you'll find the storage
code slightly more complicated once you start digging into it.

After thinking on this a little I have a possible proposal that
involves a seperate db_file/table but I haven't worked everything out
yet.  I'll think on it some more and reply back to see if it interests
folks.

Comment 8 David Koppelman 2004-04-29 10:26:01 UTC
Michael Parker wrote:

> So when you run backup do you get the hashed token or the original
> token?  What if not all original tokens are available?

Always the hashed token, the original token if it's available.
The absence of the OT is no problem, just store a NULL.

> > > 2) Where are you going to put this data?  [snip]
> > 
> > I'd put the original token in the same record used to hold the other
> > information (num ham, num spam, atime).  [snip]
> 
> You just changed the database format.  [snip]

Yes, in that sense I did.  What I meant by no new database format was
that it would be the same format with the OT option on or off.

> After thinking on this a little I have a possible proposal that
> involves a seperate db_file/table but I haven't worked everything out
> yet.  I'll think on it some more and reply back to see if it interests
> folks

I'd appreciate it if Michael or anyone else working on this problem
post here before starting.
Comment 9 Theo Van Dinter 2004-04-29 10:27:06 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

On Thu, Apr 29, 2004 at 10:09:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> After thinking on this a little I have a possible proposal that
> involves a seperate db_file/table but I haven't worked everything out
> yet.  I'll think on it some more and reply back to see if it interests
> folks.

Just to put it my two cents...

I too will miss being able to tell what actual tokens are what, but if
it gives a performance gain, fine.  That information is largely, well,
informational.  What matters for scoring is that we see "block of data",
not how we choose to represent said block.

In general, we do _NOT_ want to add unnecessary complexity to this
system.  It's bad for performance, and it's horrid for maintenance.

Adding the original token to the same DB is pointless, you'll eliminate
all the performance benefits, at least in DBM.

Adding another DB is doable, but fairly complex.  Expiry is going to be
even more complex than it is now, and take more time.  Backup/Restore
would just need to dump/restore the extra DB.  The code would have to
always access the new DB so that the original tokens can be accessed for
dump and header display.  Oh, and internally, we'd need to track both the
hashed and the original.  Oh, and the journal will have to be modified
to have a "hash-to-text" line type.  Not to mention issues with doing
multi-word tokens.  There's very likely more, but this scares me enough.

In short, it adds much more complexity than I'm comfortable with.
So unless there's a compelling reason, I'm -1 on the idea of keeping
hash<->text data around.

Comment 10 Sidney Markowitz 2004-04-29 10:34:24 UTC
I like the general idea of having the actual tokens available somehow for
analysis. I'm concerned about possible effects on perfomance if they are in the
main database. Michael's thoughts about getting them in a separate optional
table may be the way to go.

An important part of the optimizations was getting the records to be all fixed
length. A quote from the MySQL manual section on optimization:

"For MyISAM tables that change a lot, you should try to avoid all
variable-length columns (VARCHAR, BLOB, and TEXT). The table will use dynamic
record format if it includes a single variable-length column."

Note that such a table of tokens would contain information that is more privacy
sensitive than the Bayes database that has only hashes. That might effect
sharing of database information.

That said, I would like to see a suggestion that dealt with all the issues in a
sufficiently compelling way to get Theo to change his mind about his -1. I just
don't know what all would be in such a suggestion.

There may be something coming out of the idea that all we are interested in
having is the ability to usually look up the original text given a hash when
doing some analysis, but the mapping does not have to be maintained, expired,
etc., to the same degtree of precisions and reliability as the rest of the database.
Comment 11 Justin Mason 2004-04-29 11:38:22 UTC
'all we are interested in
having is the ability to usually look up the original text given a hash when
doing some analysis, but the mapping does not have to be maintained, expired,
etc., to the same degtree of precisions and reliability as the rest of the
database.'

agreed.

BTW, I know Dan was expressing some concerns last night about exposing too much
of the bayes data, saying that even with that info, it's not much use for people
to do stuff with -- sure, you know that token "foo.html" is a good spam sign,
but in what way does that help, given that it's not really possible to actually
edit the bayes db token-by-token?

This is a good point and needs some discussion.
Comment 12 Justin Mason 2004-04-29 11:40:45 UTC
Oh, BTW -- I do think, however, that a degree of visibility into the bayes
calculation, and the tokens it uses, and even the tokens in the db, is very
valuable.

It's part of the SpamAssassin "philosophy" to be open about how the
classification was decided -- that's why we list the names of the rules that hit
in the X-Spam-Status header, in the report, that's why we have rule
descriptions, readable rules files, and so on.
Comment 13 David Koppelman 2004-04-29 14:02:08 UTC
Justin Mason wrote:

> BTW, I know Dan was expressing some concerns last night about
> exposing too much of the bayes data, saying that even with that
> info, it's not much use for people to do stuff with -- sure, you
> know that token "foo.html" is a good spam sign, but in what way does
> that help, given that it's not really possible to actually edit the
> bayes db token-by-token?

For the end user providing db statistics showing the original tokens
would serve the same functions as showing individual token scores for
a message: it will help the user understand what SpamAssassin is doing
so the user will have reasonable expectations for what it can and
cannot do and will let the user know why he or she should take more
care in which messages are learned.  BTW, I myself consider this a
weak reason for keeping the original tokens given that the tokens can
already be shown in the message.

The much more important reason for keeping the original tokens is for
a source of ideas on how to make future improvements.  For example,
one improvement that has been suggested (and used elsewhere) is to
tokenize word pairs.  Examining the database might give hints about
whether too many rare word pairs are being retained.  Another question
that can be examined is how much do random character strings (used in
some spam) inflate the database?  One can have a look.  As for
"a.html", perhaps tokenizing markup would help.
Comment 14 David Koppelman 2004-05-02 09:35:50 UTC
I've worked up a patch that allows the original text (which is called
token text or ttext in the code) to be optionally kept in the
database, this only works for DBM.  The default value is off.

The new keep_token_text option can be turned on or off at any time.
While off token text will be deleted as tokens are encountered (rather
than wiping the whole database), while on token text will be added as
they are encountered, including for touches.

I've modified the SQL backend to return empty strings for token
text, BUT I HAVE NOT TESTED THIS CODE.

This does not require a new format.  I was considering using the
format field (see FORMAT_FLAG near the bottom of DBM.pm) to indicate
the presence of ttext in a packed token, but decided against it since
that would preclude an additional compact token format (the format
field is two bits).  The code relies on DB_File to provide a packed
string of the correct length.  If it's longer than it should be it
will be interpreted as token text (but this will only impact parts
that display token text retrieved from the database, such as dump and
backup).
Comment 15 David Koppelman 2004-05-02 09:36:30 UTC
Created attachment 1933 [details]
Proposed implementation.
Comment 16 Daniel Quinlan 2004-05-02 13:18:30 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

I'd prefer a "bayes_hash_tokens" option (default = turned on) over this
that affected all storage methods:

 * on: hash tokens
 * off: no hashing or truncation of tokens

Simple and easy to understand.  Probably would require some hooks in
some storage methods to change the format of the tokens, maybe a per-DB
format flag.  Code would not need to (and probably should not) support
either/both formats.

Comment 17 David Koppelman 2004-05-02 14:17:14 UTC
It sounds a bit simpler, but there would be no way to go from hashed
to non-hashed tokens without starting with an empty database.

BTW, what do you mean by truncation of tokens?
Comment 18 Sidney Markowitz 2004-05-02 14:53:25 UTC
What about having a tool that a corpus could be run through which uses the
existing code in SpamAssassin to split up the input into tokens, hash them, and
then writes to a database of tokens keyed on the hash.

As a standalone program with a separate database it would have no effect on
performance, but it would be available to be run when someone wants to be able
to analyze a Bayes db in terms of the original tokens. If there are a few extra
entries, that doesn't hurt, and if a few uninteresting tokens are missing that
probably doesn't hurt either.

This way there are no problems with configuration, backup/restore, etc., and the
tool would not have to be run often. If it is set up to add entries to a
dictionary, someone could run it periodically on saved ham and spam to keep a
dictionary built up, or they could maintain their research corpus and when they
want to analyze Bayes results for it run the tool over the corpus to make a
fresh database.
Comment 19 Daniel Quinlan 2004-05-02 14:57:38 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

> It sounds a bit simpler, but there would be no way to go from hashed
> to non-hashed tokens without starting with an empty database.

Yes, pick one or the other.  We have to maintain this code and it's
already too complicated.

Comment 20 Daniel Quinlan 2004-05-02 14:59:23 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

> What about having a tool that a corpus could be run through which uses
> the existing code in SpamAssassin to split up the input into tokens,
> hash them, and then writes to a database of tokens keyed on the hash.

Way too complicated.

Comment 21 Michael Parker 2004-05-02 15:16:23 UTC
While I'm not convinced that we need to do this, I'm willing to concede that
it's something we'll get asked about over and over again.

I've got a possible design that will make everyone happy here so I'm taking.

Comment 22 Justin Mason 2004-05-03 11:50:11 UTC
BTW, not to be a pain -- but are we planning to do the approach listed at bug
2266 comment 17? 

That seems like a pretty good way to be able to list the tokens in one common
situation, ie. when scoring a message, without having to keep the token text in
the db. 

Comment 23 Michael Parker 2004-05-03 11:54:46 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

On Mon, May 03, 2004 at 11:50:13AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From jm@jmason.org  2004-05-03 11:50 -------
> BTW, not to be a pain -- but are we planning to do the approach listed at bug
> 2266 comment 17? 
> 
> That seems like a pretty good way to be able to list the tokens in one common
> situation, ie. when scoring a message, without having to keep the token text in
> the db. 

We're already doing that.

Comment 24 Justin Mason 2004-05-03 12:33:20 UTC
Subject: Re:  Bayes option to keep original token as db data (not key). 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>We're already doing that.

OK, missed that.  sorry ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAlp5mQTcbUG5Y7woRAqiEAJ9IHmW00kuDlcOUFe0ikOSkcN3EEACeJg3D
38jUTYyFv87x1y0YTPG2BvM=
=RzHM
-----END PGP SIGNATURE-----

Comment 25 Michael Parker 2004-08-04 13:57:01 UTC
An update on this bug, I've got about 90% of my proposed implementation done.

Here are the basics:

1) Add a new config param bayes_ignore_raw_token, default is 0

2) New db format version 4:
  DBM - this adds an optional A* to the tokens packed value
  SQL - this adds a raw_token column, you can omit the column if you set
        bayes_ignore_raw_token to 1

3) In theory you can switch back and forth between ignoring and not ignoring the
raw token value and everything will work.  If you ignore for a period of time,
when you stop ignoring the code will fill in any blank values as it updates a
token (NOTE will only update when the spam/ham count changes, not when it is
touched).  Starting to ignore a previously unignored raw token won't remove it
from the database, you would have to --backup/--restore to remove it totally.

4) --backup/--restore will behave correctly.  The backup format will include the
token value if you are not ignoring.  Restore will parse the raw token value if
available and insert into database if you are not ignoring.

For SQL, advance users can opt to not include the raw_token column in their
database schema.  With bayes_ignore_raw_token set to 0 the code will never refer
to that column so there should never be any sort of SQL error.  However, if the
user decides to stop ignoring then they will need to add that column back to the
schema.

Anyone game for trying to get this into 3.0.0?  It would mean that we could
possibly get away with not bumping the db version.  Also, reduce the number of
questions from folks about not seeing the actual token value via dump and what not.
Comment 26 David Koppelman 2004-08-04 14:10:51 UTC
Perhaps "bayes_keep_raw_tokens" would better indicate what it's doing (of course
using 0 for don't keep).
Comment 27 Michael Parker 2004-08-04 14:15:15 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

On Wed, Aug 04, 2004 at 02:10:52PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> ------- Additional Comments From koppel@ece.lsu.edu  2004-08-04 14:10 -------
> Perhaps "bayes_keep_raw_tokens" would better indicate what it's doing (of course
> using 0 for don't keep).
> 

My plan was actually to store the raw tokens by default.  Then let
advanced users, who want the speed/space benefits, disable.

I don't mind make the config param bayes_keep_raw_tokens but I think
it should default to 1.

I think it will cut down a good bit on support type questions, "Hey
why can't I see my token data?"

Michael

Comment 28 David Koppelman 2004-08-04 14:20:04 UTC
My concern wasn't the default setting, it was that it would be unlikely that
anyone would guess the meaning of "ignore" without reading the documentation,
and for those that don't look at configuration files everyday the meaning would
be easy to forget.
Comment 29 Michael Parker 2004-08-04 14:22:58 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

Point taken, I'm ok with changing the name.

Michael

Comment 30 Daniel Quinlan 2004-08-04 15:05:40 UTC
Subject: Re:  Bayes option to keep original token as db data (not key).

bayes_(store|keep|save)_(original|raw)_tokens

I think store > save >> keep and original >> raw.

Comment 31 Sidney Markowitz 2004-08-04 16:09:30 UTC
I really like how all the effects of this addition disappear into no overhead if
the option is set not to use it and the schema is set accordingly.

I agree that the default should be set for low volume users who may want to see
the tokens in order to better understand how Bayes works with their mail. If we
do that that we should document somewhere that is *very* hard to miss that we
have a set of suggestions for configuring SpamAssassin for high volume sites,
and write something up about all the tweaks that might be necessary for high
volumes. A HighPerformance entry in the Wiki would be fine if there is a pointer
to it.

I'm +1 on getting this in 3.0. If we don't then we have to deal with people
wanting to get their Bayes tokens information back, and then deal again with
letting them know that they have them again in 3.1 but that they have to change
their schemas yet again. If we do it now we may get away with no schema change
for 3.1.
Comment 32 Michael Parker 2004-08-08 20:25:46 UTC
Created attachment 2226 [details]
Patch File

This implements the feature pretty much as discussed.  The config option has
been changed to bayes_store_original_token and it defaults to on.

Opted to keep the db version at 3, this does create some support problems for
those who have 3.0.0 up and running with SQL, but easy to overcome.  We will
need a LARGE NOTICE in the release notes for the next pre/rc release.
Comment 33 Michael Parker 2004-08-08 20:32:34 UTC
Please review for inclusion in 3.0.0.

I've still got some testing to do 1) upgrade db from version 2 databases, 2)
make sure DBM v3 dbs pick up the changes ok, I know SQL v3 dbs are gonna need an
ALTER TABLE command.  I don't anticipate any problems.

Speed/size numbers are interesting, for DBM you get a 9% slowdown and a 96%
increase in db size.  For SQL the speedup is statistically insignificant, there
is a size increase, however I haven't measured it yet.

I'll report more numbers once I run some more benchmarks.
Comment 34 Daniel Quinlan 2004-08-08 22:45:01 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

Is there a limit on the size of original tokens?

How much does this slow down DBM when the option is turned off vs. when
the patch has not been applied?

If you turn the option off, are original tokens removed over time?

I assume when upgrading from version 2 that the original tokens will be
stored in this field if the option is turned on.

Did you try ignoring hapaxes (single count tokens) to reduce the size
impact?

Comment 35 Sidney Markowitz 2004-08-08 23:18:30 UTC
> Did you try ignoring hapaxes (single count tokens)
> to reduce the size impact?

That's an interesting possibility. Since the hash can be considered to uniquely
identify the token you could not store the token when creating a new entry, then
put it in the second time the token appears.

I think that should be an option if we do it, so people can choose the degree of
tradeoff between completeness of inforamtion about the tokens and disk space:
Minimum space with no token informaiton; Medium space with information on
non-hapaxes; Maximum space with with information on all tokens.
Comment 36 Justin Mason 2004-08-08 23:34:11 UTC
BTW one thing about hapaxes is that they have quite a large effect on the
resulting score.  (wierd but true.)   This means that a dump without hapaxes may
still not be so useful, since they're not "insignificant" in other words.
Comment 37 Daniel Quinlan 2004-08-09 01:18:06 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

> BTW one thing about hapaxes is that they have quite a large effect on
> the resulting score.  (wierd but true.)  This means that a dump
> without hapaxes may still not be so useful, since they're not
> "insignificant" in other words.

Urgh.  Let's avoid options if at all possible.  The sheer count of
options is very oppressive to new users.

Comment 38 Sidney Markowitz 2004-08-09 05:42:47 UTC
I agree about not having unecessary options. _If_ we drop hapaxes it should be
an option. Therefore I wouldn't want to see dropping hapaxes.
Comment 39 Michael Parker 2004-08-09 12:24:25 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Sun, Aug 08, 2004 at 10:45:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> Is there a limit on the size of original tokens?
> 

Not currently, I'm contemplating a limit, at least on the SQL side, of
128 or 200.  Opinions?

> How much does this slow down DBM when the option is turned off vs. when
> the patch has not been applied?
> 

Hmmmm..anywhere from 0-7%, it was actually faster for the spamd
phase.  I'm inclined to run the test again, something seems fishy but
it's possible I've got some innerloop if statement that is causing a
problem.

> If you turn the option off, are original tokens removed over time?
> 

Yes, as tokens are updated they will be updated with blank raw_token
values.

> I assume when upgrading from version 2 that the original tokens will be
> stored in this field if the option is turned on.
> 

That's the idea, haven't tested this yet.

> Did you try ignoring hapaxes (single count tokens) to reduce the size
> impact?
> 

I forgot about this, very easy to do, just a matter of running some
benchmarks.

I'm still running benchmarks, each one takes 50min to 1hr to run.  3
times each type of test times 3-4 tests per storage engine takes a
little while.

Michael

Comment 40 Sidney Markowitz 2004-08-09 12:29:00 UTC
I'm in favor of a limit to protect against a possible DoS and also to provide a
rational basis for choosing the size of the VARCHAR field in the db schema.
There is no reason for using tokens larger than that field. If we did it would
break the mapping between the tokens in the db and the hash. I don't care if it
is 128 or 200.
Comment 41 Michael Parker 2004-08-09 12:32:54 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

Just to make it clear what I'm thinking, I won't limit the token size
until after the SHA1 hash has been computed.  So there might be a
largish token whose SHA1 hash will match exactly the original size but
the data stored in the database will be shorted to 128 or 200
characters.

Michael

Comment 42 Sidney Markowitz 2004-08-09 13:04:21 UTC
> I won't limit the token size until
> after the SHA1 hash has been computed

No, I don't like that at all. It doesn't protect against a DoS in the easiest
place to protect against it, and it leaves the database with entries that do not
have the useful two-way mapping between token and hash.

It would be much simpler to truncate tokens before hashing. There is not going
to be any statistical difference between two tokens that have the same 128 or
200 byte prefix as far as heir being spam or ham signs.
Comment 43 Theo Van Dinter 2004-08-09 13:07:16 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Mon, Aug 09, 2004 at 12:32:55PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Just to make it clear what I'm thinking, I won't limit the token size
> until after the SHA1 hash has been computed.  So there might be a
> largish token whose SHA1 hash will match exactly the original size but
> the data stored in the database will be shorted to 128 or 200
> characters.

Yeah, that sounds like a good idea.  The plaintext token is meant to
be informational at this point, so I think even limiting small (32-96)
is fine.  Perhaps make truncation put "..." at the end or something to
note that the token was truncated.

Comment 44 Sidney Markowitz 2004-08-09 13:15:55 UTC
> The plaintext token is meant to be informational at this point,
> so I think even limiting small (32-96) is fine.

I'm ok with that as long as there is some truncation at the front end too so
that the token size is not open-ended. Umm, maybe that already is the case
because of the way tokens are generated? Do any other limits on header size or
line size in bodies or something else set a natural limit on token size? If the
way we have things coded now means that super large tokens will not DoS the
parsing of tokens or the computation of the hash, then I am fine on truncating
at the backend: The "informational only" argument isa convincing to me.
Comment 45 Daniel Quinlan 2004-08-09 13:26:02 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

> Perhaps make truncation put "..." at the end or something to note that
> the token was truncated.

That could be done at display time, but I think it will be quite clear
that there's truncation by just looking at the values.  No reason to put
three bytes in the databaes.  :-)

Daniel

Comment 46 Justin Mason 2004-08-09 13:32:06 UTC
tokens are already stringently limited in size -- even lower than that.  (16
bytes plus a prefix is the typical truncation point.)

there's a little code to preserve really important data that we want to be able
to dump out -- such as IP addresses of relays found, From addrs, To addrs, etc.

so truncate to 128 bytes if you like, that should be loads of room.
Comment 47 Sidney Markowitz 2004-08-09 13:42:45 UTC
In the relatively rare case of the field being exactly the maximum length, the
code can tell if there was truncation by computing the SHA-1 hash and seeing if
it matches the stored hash. In that case, displaying "..." after the maximum
length would be unambiguously different from any real string that ends in "..."

Given what Justin said about the token size being limited, I have no objections
to this design, including truncating the stored tokens to 128 bytes and defining
the field in the schema as VARCHAR(128).
Comment 48 Michael Parker 2004-08-09 13:54:42 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

I'm going to KISS, will truncate at 128 and won't put any sort of
'...' indication unless someone starts screaming about it.

Michael

Comment 49 Michael Parker 2004-08-10 12:00:42 UTC
Created attachment 2233 [details]
Patch File

Not the final patch, just some updates to speed things up.
Comment 50 Michael Parker 2004-08-10 12:01:25 UTC
Removing [review] for now since I'm making some slight changes to the patch.
Comment 51 Michael Parker 2004-08-11 13:02:42 UTC
Created attachment 2236 [details]
Patch File
Comment 52 Michael Parker 2004-08-11 13:06:49 UTC
Ok, I'm happy with the patch now.  I will say that the Bayes storage code is
pretty well tuned and highly sensitive to even simple statements (ie if (foo)).
 This code causes an approximately 2-3% slowdown of the storage code for DBM.

Still have a couple of tests to run, but it's ready to review.
Comment 53 Justin Mason 2004-08-11 13:51:59 UTC
+0.9: 

why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change?

+	(my $unpacked_token, $raw_token) = split(/\s+/, $token, 2);

split (/ /, ...) would probably be faster, if you can make that assumption.
ditto in the @@ -1594,14 change.


...BTW, you were asking what term to use for configuration settings; looks like
you went for "option".  worth noting that Apache uses "directive".

Comment 54 Daniel Quinlan 2004-08-11 14:11:21 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

> split (/ /, ...) would probably be faster, if you can make that assumption.
> ditto in the @@ -1594,14 change.

Maybe split (' ', ...) instead of that?

Comment 55 Michael Parker 2004-08-11 14:17:49 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Wed, Aug 11, 2004 at 01:52:00PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From jm@jmason.org  2004-08-11 13:51 -------
> +0.9: 
> 
> why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change?
> 

The tok_touch code used to update the raw_token so I had made bindings
dynamic, then I removed that since a) tok_touch is never actually
called and it wasn't easily done in tok_touch_all and b) wasn't really
necessary.  However, I left the dynamic nature of the bindings in,
just in case there was some future use.

I could back it out if it bothers folks.


> +	(my $unpacked_token, $raw_token) = split(/\s+/, $token, 2);
> 
> split (/ /, ...) would probably be faster, if you can make that assumption.
> ditto in the @@ -1594,14 change.
> 

The backup format has \t between the elements, however I've seen some
unpack(H*) tokens with spaces at the end, so we need to check for 1 or
more whitespaces.

Michael

Comment 56 Justin Mason 2004-08-11 14:29:39 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key). 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> > split (/ /, ...) would probably be faster, if you can make that assumption.
> > ditto in the @@ -1594,14 change.
> 
> Maybe split (' ', ...) instead of that?

no -- the speed profiles would be

    split (' ', ...) is slower than
    split (/\s+/, ...) is slower than
    split (/\s/, ...) is slower than
    split (/ /, ...)

' ' has various additional semantics (like skipping whitespace at
start of line and EOL iirc).

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBGo+4QTcbUG5Y7woRAutWAJ4z57Nd0Kzn4/EUbFmlzk8jmcC1DwCZAUr7
xcOHWflN65dosl7CWPb7WC8=
=oL+L
-----END PGP SIGNATURE-----

Comment 57 Justin Mason 2004-08-11 14:32:55 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key). 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> > why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change?
> 
> The tok_touch code used to update the raw_token so I had made bindings
> dynamic, then I removed that since a) tok_touch is never actually
> called and it wasn't easily done in tok_touch_all and b) wasn't really
> necessary.  However, I left the dynamic nature of the bindings in,
> just in case there was some future use.
> 
> I could back it out if it bothers folks.

No -- just wondering why a temporary copy was used in that one function.
I can see in the other function that uses it, it does chnage the contents
of @bindings, but in that one function, it just uses it as-is, like so:

@@ -1025,6 +1026,8 @@
 
   return 0 unless (defined($self->{_dbh}));
 
+  my @bindings;
+
   # shortcut, will only update atime for the token if the atime is less than
   # what we are updating to
   my $sql = "UPDATE bayes_token
@@ -1033,9 +1036,10 @@
                 AND token = ?
                 AND atime < ?";
 
- -  my $rows = $self->{_dbh}->do($sql, undef, $atime, $self->{_userid},
- -			       $token, $atime);
+  @bindings = ($atime, $self->{_userid}, $token, $atime);
 
+  my $rows = $self->{_dbh}->do($sql, undef, @bindings);
+
   unless (defined($rows)) {
     dbg("bayes: tok_touch: SQL Error: ".$self->{_dbh}->errstr());
     return 0;


note that the assignment to @bindings comes immediately before its
only use in that fn, so it seems superfluous to use it there
instead of just 

   my $rows = $self->{_dbh}->do($sql, undef,
                $atime, $self->{_userid}, $token, $atime);

> > +	(my $unpacked_token, $raw_token) = split(/\s+/, $token, 2);
> > 
> > split (/ /, ...) would probably be faster, if you can make that assumption.
> > ditto in the @@ -1594,14 change.
> 
> The backup format has \t between the elements, however I've seen some
> unpack(H*) tokens with spaces at the end, so we need to check for 1 or
> more whitespaces.

ok.  shame ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBGpCAQTcbUG5Y7woRAmQhAKCtlPyStdh0jSR3aPJAsKBq3DOf1gCgxDSc
FbN2/rGmsmR06ZD15FHCst4=
=vJfV
-----END PGP SIGNATURE-----

Comment 58 Michael Parker 2004-08-11 14:58:38 UTC
Created attachment 2238 [details]
Patch File

Fixes a bug when upgrading from a 2.64 database, was ignoring the
bayes_store_original_token setting.

Also removes the extra @bindings change in tok_touch since it isn't necessary.
Comment 59 Justin Mason 2004-08-11 15:33:14 UTC
+1
Comment 60 Daniel Quinlan 2004-08-11 23:15:53 UTC
Okay, I need to get this out.  This is excellent code.  I love the patch.

I have major concerns about adding this option, though:

1. In the best case (turned off) it slows down DBM Bayes by adding 3.3% more
   time. I don't really care about the extra time and space when it is turned
   on.  In many sites, it's not even possible for users to dump the Bayes
   database so this is pure overhead.  HAMMYTOKEN and SPAMMYTOKEN also work
   without it (although I should note I think we should drop those too).

2. It's another option to support.

3. I don't think it's needed.  We had a few questions when we released pre1 and
   before that, but I've gotten used to not looking and nobody really needs to
   look.

4. It complicates the code.

If we were a commercial project, there would be no bloody way this would go
into the code.  Seriously, only nosy technical people care and apparently not
for too long judging by the 3.0 experience so far.
Comment 61 Justin Mason 2004-08-11 23:21:20 UTC
I think leaving it out, removes functionality that a lot of people *do* expect.
we'd be pretty much the only bayes impl apart from dspam and CRM-114 (the n-gram
systems) that don't support dumping, in that case.

However I could compromise by setting it off by default, and noting clearly in
the doco how to turn it on...
Comment 62 Theo Van Dinter 2004-08-11 23:49:33 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Wed, Aug 11, 2004 at 11:21:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I think leaving it out, removes functionality that a lot of people *do* expect.
> we'd be pretty much the only bayes impl apart from dspam and CRM-114 (the n-gram
> systems) that don't support dumping, in that case.

Well, that's not really true.  There are tons of Bayes-style filters out
there (how many MUAs have built-in Bayes-style filters?) which don't do
dumping.  If you compare us to other OSS projects, you could be right.
However, I do agree with Quinlan in that normal users won't care.
The more technial folks, or the folks who are more involved with things
like running sa-learn may care.

For example, I was upset when the hashing first went in, but then I
realized I have better things to do with my time than going through
the tokens in my Bayes DB.  The rest of the users on my box, however,
haven't said anything, leading me to think they don't ever look at the
dump output.

> However I could compromise by setting it off by default, and noting clearly in
> the doco how to turn it on...

The issue, though, is the performance penalty even with the setting off.

Comment 63 Michael Parker 2004-08-12 00:06:55 UTC
Created attachment 2240 [details]
Data Collection Spreadsheet

Here is a spreadsheet from my benchmark runs.  All numbers are in seconds.  Let
me know if you have any questions.

A brief description of the benchmark:
Phase I: Learn 2000 ham and 2000 spam
Phase II: Scan 2000 ham and 2000 spam using spamd and a parallel spamc
processors
Phase III: Force Expire
Phase IV: Force first 1000 ham and 1000 spam from Phase I
Phase V: Us spamassassin to scan 2000 ham and 2000 spam

BTW, disk space requirements basically double between the store and non-store
cases, that was part of the spreadsheet but I got tired of tracking it.
Comment 64 Michael Parker 2004-08-12 00:11:14 UTC
Created attachment 2241 [details]
Data Collection Spreadsheet

Fixed a couple of stupid cut and paste typos.  Everything I said in the
previous note still applies.
Comment 65 Michael Parker 2004-08-12 08:39:03 UTC
Created attachment 2245 [details]
Data Collection Spreadsheet

I must have been half asleep when I labels some of those columns last night,
please disregard the other spreadsheets, they are totally mislabled.
Comment 66 Daniel Quinlan 2004-08-12 13:25:03 UTC
Maybe it's just a bad idea to integrate this into SpamAssassin directly
due to the 3-4% performance drop, perhaps a separate script/application
which calls the tokenizer and can generate a mapping database is the right
solution for the curious people out there.
Comment 67 Sidney Markowitz 2004-08-12 15:22:25 UTC
A slowdown of 3-4% when all that has to be done is test a flag to skip over code
makes no sense to me. Where is the overhead happening? Could that extra overhead
be a bug or a design error?
Comment 68 Michael Parker 2004-08-12 15:25:59 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Thu, Aug 12, 2004 at 03:22:26PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From sidney@sidney.com  2004-08-12 15:22 -------
> A slowdown of 3-4% when all that has to be done is test a flag to skip over code
> makes no sense to me. Where is the overhead happening? Could that extra overhead
> be a bug or a design error?

No clue, if you can find it I'd love to know where.

Michael

Comment 69 Justin Mason 2004-08-12 15:38:32 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key). 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> > A slowdown of 3-4% when all that has to be done is test a flag to skip
> > over code makes no sense to me. Where is the overhead happening? Could
> > that extra overhead be a bug or a design error?
> 
> No clue, if you can find it I'd love to know where.

I think it may be at a higher level -- passing a parameter down through 3
levels of method calls (?) does have an effect on runtime, esp. when the
lowest level is called very frequently, as tok_pack()/tok_unpack() are.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBG/FdQTcbUG5Y7woRAqh1AKC8Ei6NGlGdL0eeg2Q34/eyIZRlGgCdHEG5
pGnQlaJlClFBqTJ8djck/fg=
=TLq3
-----END PGP SIGNATURE-----

Comment 70 Sidney Markowitz 2004-08-12 16:11:42 UTC
> passing a parameter down through 3 levels of method calls (?) 
> does have an effect on runtime, esp. when the lowest level
> is called very frequently, as tok_pack()/tok_unpack() are.

That's what I mean... If there is some test being repeated inside some very
small inner loop that makes _that_ much differnce to the runtime, I would think
that it could be lifted out with minimal duplication of code.

Even more important, this may point to some place that can be optimized to
improve performance. What if it turns out that a tweak to the inner loop could
give us a 3-4% _improvement_ over the existing code?

Unfortunately I don't have the time right now to put into this. Can one of you
more perl-wizardly folks who is handy with a profiler see if there is anything
obvious to tweak?
Comment 71 Theo Van Dinter 2004-08-12 17:22:46 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Thu, Aug 12, 2004 at 04:11:43PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> > passing a parameter down through 3 levels of method calls (?) 
> > does have an effect on runtime, esp. when the lowest level
> > is called very frequently, as tok_pack()/tok_unpack() are.

so I haven't looked recently, but what am I missing here?   tok_pack gets
called with the raw data, it does the SHA1 and the raw data if necessary.
the rest of the time, the raw token ought to be passed around internally
anyway.

this shouldn't be a big time suck.

Comment 72 Michael Parker 2004-08-12 17:31:09 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Thu, Aug 12, 2004 at 05:22:47PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> so I haven't looked recently, but what am I missing here?   tok_pack gets
> called with the raw data, it does the SHA1 and the raw data if necessary.
> the rest of the time, the raw token ought to be passed around internally
> anyway.
> 
> this shouldn't be a big time suck.
> 

The SHA1 happens in tokenize, as the code stands today the
tok_pack/tok_unpack methods have no idea what the raw token value is.

The new code passes the raw_token down the various levels, along with
the SHA1 token and when it gets to tok_pack packs it into the value
portion with the SHA1 token as the key.

Michael

Comment 73 Loren Wilton 2004-08-12 18:08:50 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

> I think it may be at a higher level -- passing a parameter down through 3
> levels of method calls (?) does have an effect on runtime, esp. when the
> lowest level is called very frequently, as tok_pack()/tok_unpack() are.

If this is performance critical and the bottom routine or two are fairly
small, might it make more sense to duplicate the bottom routine or two
without the extra parameter and decide which routine to call at a higher
level?  This would eliminate passing the parameter to the most-called
routines, and possibly would move the test decision to some place less often
than once a token.

        Loren

Comment 74 Malte S. Stretz 2004-08-13 06:22:00 UTC
I just had a quick look at this thread, but if passing variables through 
functions is really that expensive in Perl, maybe we should work with a bunch 
of instance-global state-variables then?  I'm no fan of global variables but if 
it really gives a performance improvement, they should be ok.  And shouldn't 
have any drawbacks as one instance of the objects is used for a single message 
at a time only (I think). 
Comment 75 Michael Parker 2004-08-13 06:53:43 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

It's all of the sanity checking, for instance:

In tok_pack:

+  # raw_token should be limited to 128 characters and should not be
undefined
+  $raw_token = (defined($raw_token) ? substr($raw_token, 0, 128) :
'');

In tok_sync_counters:

+  # if a $raw_token didn't get passed in, but there was one set keep
the original
+  # but only if we want to store the original token.  This allows raw
token values
+  # to filter out of the database when the option gets turned off.
+  if (!$raw_token &&
$self->{bayes}->{conf}->{bayes_store_original_token}) {
+    $raw_token = $orig_raw_token;
+  }
+

In tok_count_change:

+  $raw_token = '' unless (defined($raw_token));
+


Sometimes $raw_token is checked 3-4 times before it finally gets
packed.  It needs to be that way because of the spaghetti code in
DBM.pm.  If we backed up and cleaned up the internal methods a little
we might could make things better.

Michael

Comment 76 Michael Parker 2004-08-13 11:50:37 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

-1 on the patch and feature

Believe me, I don't really like doing it, I spent a large amount of
time coding it up and testing/benchmarking it all, but I feel it needs
to be done.

Even after a call for comments on the -users list there was little
comment and no one coming forward with a compelling reason for keeping
the token values around.

The patch in it's best case causes a 3-4% slowdown, for some things
it's a 45-48% slowdown.  Even stripping out what I would call critical
sanity checking code there is a 1-2% slowdown and I'm no longer
confident in the data being stored in the database.

Perhaps it's enough to have done the patch and placed it in this bug
report, then when someone comes up with a reason we can refer them to
the patch and they can apply it locally.  Obviously it wouldn't be
officially maintained, but that doesn't mean someone else couldn't
maintain the patch and re-attach as needed.

Michael

Comment 77 Justin Mason 2004-08-13 12:10:54 UTC
my commiserations. ;)

I think we should at least release an RC build with the "opaque tokens" code as
we currently have it in SVN, and see if any people find the lack of this feature
to be really annoying... just to give the change a little more visibility. 
Perhaps point it out in the change notice.

It really does appear that nobody misses it very much, and the benefits (speed)
outweigh the loss (the feature), in which case we should leave this code out as
you suggest.

(BTW Dan's suggestion -- an external tool that can "fill in" an optional "side"
database of hash->raw_token mappings for sa-learn --dump to display -- may still
be an option in future.  but I think that can be implemented in a non-invasive way.)
Comment 78 Sidney Markowitz 2004-08-13 12:42:30 UTC
I agree for not putting this in to 3.0.

An addon to dump tokens and generate a dictionary is always an option as long as
the original source of the email exists. That should take care of anyone who
wants to research the Bayes stuff. The information is not lost just because it
is not stored in the database as long as there is the corpus that it came from,
and anyone researching Bayes would need to keep the corpus anyway.

I still think that the results of this attempt may point to some optimization
that we can get when someone has the time to pursue it. Michael referred to the
"spaghetti code" in DBM.pm as being in the way of removing parameters being
passed down through many levels and having to check them in many places. Perhaps
this points to benefits from cleaning that up. Of course this would be for some
version past 3.0.
Comment 79 Michael Parker 2004-08-13 13:53:52 UTC
Created attachment 2249 [details]
Patch File

Different concept, thanks to Daniel and Justin.

Added a plugin hook to Bayes.pm after each call to tokenize.  It passes in the
tokenize hashref, what function we're currently performing and if the message
isspam or ham for the learn and forget functions.
Comment 80 Michael Parker 2004-08-13 13:54:57 UTC
Please review.

I'm removing my veto.

With this patch it is very easy to setup a plugin to save off the data you are
interested in.
Comment 81 Justin Mason 2004-08-13 13:57:40 UTC
+1 -- like it ;)
Comment 82 Theo Van Dinter 2004-08-13 19:45:02 UTC
if we're going to do a plugin (+1 on that idea btw), we should make it a little more generic imo.

a few things come to mind immediately:

- bayes_tokens could just be bayes
- scan should wait until after the probability calculations to call the plugin.  that way the plugin can get 
the ham/spam count, probability, msgatime, etc.
- we can pass in the msgid (it's generated)
- learn has msgatime as well (forget doesn't need it)
Comment 83 Michael Parker 2004-08-13 19:55:44 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Fri, Aug 13, 2004 at 07:45:03PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From felicity@kluge.net  2004-08-13 19:45 -------
> if we're going to do a plugin (+1 on that idea btw), we should make it a little more generic imo.
> 
> a few things come to mind immediately:
> 
> - bayes_tokens could just be bayes
> - scan should wait until after the probability calculations to call the plugin.  that way the plugin can get 
> the ham/spam count, probability, msgatime, etc.
> - we can pass in the msgid (it's generated)
> - learn has msgatime as well (forget doesn't need it)
> 

I can go along with that.  Should there be three plugin calls?
bayes_scan
bayes_learn
bayes_forget

Since each one will pass in slightly different data, or keep the
"function" key that folks can use to do the right thing?

Michael

Comment 84 Theo Van Dinter 2004-08-13 20:29:51 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

On Fri, Aug 13, 2004 at 07:55:46PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I can go along with that.  Should there be three plugin calls?
> bayes_scan
> bayes_learn
> bayes_forget
> 
> Since each one will pass in slightly different data, or keep the
> "function" key that folks can use to do the right thing?

The three functions sound good.  Potentially that function could be
called 3 times in the same run (scanning an already learned message,
autolearn tries to learn it the other way, needs to call forget ...)
So it would be cleaner to separate them out imo. :)

Comment 85 Daniel Quinlan 2004-08-13 22:52:21 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

> The three functions sound good.  Potentially that function could be
> called 3 times in the same run (scanning an already learned message,
> autolearn tries to learn it the other way, needs to call forget ...)
> So it would be cleaner to separate them out imo. :)

I'd also like there to be a way for a single "get the tokens" method to
be easily called just once per message without tracking too much state
in the plugin (whether we're learning, forgetting, or scanning).

Comment 86 Theo Van Dinter 2004-08-14 12:31:52 UTC
I was just thinking, calling the plugin on dump() would be good too.  perhaps something like:

dump_start()
 - happens before data is given out
dump_data()
 - called per token output line.  can perhaps return something to be appended to the dump output?
dump_end()
 - happens before dump returns
Comment 87 Justin Mason 2004-08-14 14:46:11 UTC
I agree that the 3 functions approach sounds good; having 1 API that receives
different combinations of parameters depending on when it's called is a bad API
design, in my opinion.  

(explanation: it's effectively 3 different APIs with different results and args
depending on what method it's called from.  By mixing them into one plugin API,
you move the work of ensuring the right code is run, from the API-design level
to the user-implementation-code level; in terms of Rusty Russell's "interface
simplicity spectrum" that's from level 5 to level 7.

  http://sourcefrog.net/weblog/software/aesthetics/interface-levels.html
  http://www.ozlabs.com/~rusty/ols-2003-keynote/ols-keynote-2003.html

Plus, there could be cases where users don't *want* to know about scan operation
token use, just learn-op token use.  This allows them to override just one of
the APIs and ignore the other.) 

Also: regarding 'a single "get the tokens" method' -- either the plugin
implementor would have to track the state, or Bayes.pm would have to track the
same state.  I think better let the plugin implementor do it. However, here's a
good way to avoid making it too hard -- add these APIs:

  - bayes_start_message($isspam, $generatedmsgid)
    - called when a single message is about to be learned
  - bayes_scan_tokens($toksref, $msgatime)
    - pass the list of tokens from that message, for a scan op
      (do we have $msgatime here?)
  - bayes_learn_tokens($toksref, $msgatime)
    - pass the list of tokens from that message, for a learn op
  - bayes_forget_tokens($toksref)
    - pass the list of tokens from that message, for a forget op
      ($msgatime is irrelevant)
  - bayes_finish_message()
    - called when that single message learning op has been completed

The bayes_start_message() and bayes_finish_message() APIs allow the plugin to
know when a new message is being operated on.  So the call order would be for
example:

    $plugin->bayes_start_message(...);
    $plugin->learn_tokens(...);     # dump the tokens
    $plugin->bayes_finish_message(...);
    $plugin->bayes_start_message(...);
    $plugin->forget_tokens(...);    # msg was previously learned
                                    # dump the tokens
    $plugin->learn_tokens(...);     # we've already dumped 'em, not again
    $plugin->bayes_finish_message(...);

So an implementor would just have to keep a single boolean that's set to 1
during a foo_tokens() call, set to 0 during bayes_start_message() -- then if
foo_tokens() find it already set to 1 it doesn't dump the tokens because that's
already been done once on that msg.

Theo: dump plugin APIs would be useful, good point -- maybe let dump_data()
rewrite the line arbitrarily.  Suggest something like this API:

$line = dump_data({
        origline => ...,
        prob => ...,
        ts => ...,
        th => ...,
        atime => ...,
        encoded_tok => ...
    })

(in other words, give it the *unstringified* data as well so that it doesn't
have to parse them out of the line.)

And then that's probably enough APIs for now ;)

Comment 88 Michael Parker 2004-08-14 23:29:10 UTC
Created attachment 2252 [details]
Patch File

Split bayes_token call out into seperate plugin calls for bayes_scan,
bayes_forget and bayes_learn.

Needs a little better docs for Plugin.pm. (define atime and format of toksref
specifically)

I'm +1 with updated docs, since I may be out of touch for the next couple of
days someone can feel free to pick up the patch and fix.
Comment 89 Justin Mason 2004-08-14 23:32:09 UTC
ok, since this adds the basic APIs we need, and this is the last bug we need to
fix before we can do an rc1, I suggest:

1. we vote and apply this patch
2. we fix up the documentation a little in C-T-R doco-fix mode afterwards.  (in
particular, the format of toksref needs to be documented in Plugin.pm.)

my vote: +1
Comment 90 Theo Van Dinter 2004-08-14 23:37:41 UTC
we may want to also note that if people want to get the raw/final token mappings, doing it in learn() is 
the way to go.

anyway, per our discussion on irc, I'm +1 on 2252 for now, and we'll have to work more on bayes 
plugin api for 3.1 (ideas were floated about allowing a complete tokenizer plugin, etc.)
Comment 91 Michael Parker 2004-08-15 06:56:20 UTC
Created attachment 2254 [details]
Patch File

New patch with doco update.  I'm still +1, feel free to commit if I'm not
around.
Comment 92 Theo Van Dinter 2004-08-15 10:48:55 UTC
+1 on 2254
Comment 93 Malte S. Stretz 2004-08-15 12:40:06 UTC
+1 on the API, I don't have the time to review the patch itself. 
Comment 94 Justin Mason 2004-08-15 22:25:03 UTC
+1 on the new patch anyway
Comment 95 Michael Parker 2004-08-15 22:30:07 UTC
Committed revision 36447.
Comment 96 Daniel Quinlan 2004-08-15 22:32:40 UTC
Subject: Re:  [review] Bayes option to keep original token as db data (not key).

+1 looks good to me (waited for off-line benchmark results from Michael)