SA Bugzilla – Bug 3331
[review] Bayes option to keep original token as db data (not key).
Last modified: 2004-08-15 14:30:07 UTC
With r10394 (bug 3225) the Bayes database no longer retains the original token, instead it uses a hashed version as a key. As noted elsewhere there are many good reasons to do this, such as compactness, efficiency, and privacy. However one can no longer get the original token from the database and so those tuning Bayesian classification can no longer get token statistics such as the hammiest or spammiest tokens. For that reason we might implement Sidney Markowitz's suggestion [bug 2266 comment 14] that as an option the original tokens be retained. Michael Parker pointed out the maintenance difficulties of having to support a database which could be keyed either on a hash or the full token [bug 2266 comment 15]. Instead the unhashed token might be stored as data while still using the hashed token as a key. If the original tokens are not to be stored then a NULL or zero-length string would be stored, otherwise the original token. This way there is just one database format. I'm assuming that a string field holding only NULLs will have little performance impact in dbfile and the SQL implementations, so that users not keeping original tokens will be unaffected. If there are no comments on this after a few days I'll cobble together a patch for the DBM code.
Subject: Re: New: Bayes option to keep original token as db data (not key). On Thu, Apr 29, 2004 at 07:12:30AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > efficiency, and privacy. However one can no longer get the original > token from the database and so those tuning Bayesian classification > can no longer get token statistics such as the hammiest or spammiest > tokens. I guess I fail to see how this is useful. You can now get the significant tokens in the headers now (Bug 2266). Are you hand tweaking your bayes database?
This is not for statistics on an individual message, but the entire database. For example, the hammiest tokens in my database come from the address and fax number where I work since they are in the sigs of alot of my email. Amusingly, one of the spammiest tokens is "a.html". This information might be useful in coming up with improved tokenizing or token weighting schemes. It's also a good way to spot some bugs, such as atime problems. And no, I'm not hand-tweaking the database, that would be way too labor intensive to be sustainable!
Subject: Re: Bayes option to keep original token as db data (not key). I'm -0 on doing this. It was debated for a long time prior to the change going in. It's a win-win all around as far as performance. I struggled with making the plain text tokens go away but the performance gain out weighed my desire to see them in their plain text form. Since I rarely find myself needing to look at the actual tokens and their values to tweak I'll defer to others. Michael
Just as a reminder, the change I'm suggesting would not require a separate database format and might not have an impact on performance for those not storing the original tokens. See the opening comment (comment 0?).
Subject: Re: Bayes option to keep original token as db data (not key). On Thu, Apr 29, 2004 at 09:16:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Just as a reminder, the change I'm suggesting would not require a > separate database format and might not have an impact on performance > for those not storing the original tokens. See the opening comment > (comment 0?). It has all sorts of implications, not just database format. And yes, it would need to be a config option. 1) How do you handle backup/restore? 2) Where are you going to put this data? In a seperate db file/table or alongside the current data? Be careful, your answer may require a ton more work. 3) Maintenance issues. 4) How do you handle the case where someone has this option turned off, but then turns it on? Patched welcome of course. Michael
Michael Parker wrote: > It has all sorts of implications, not just database format. And yes, > it would need to be a config option. Default off, of course. > 1) How do you handle backup/restore? I don't see the issue, other than having to add code to write or read the original token. > > 2) Where are you going to put this data? In a seperate db file/table > or alongside the current data? Be careful, your answer may require > a ton more work. I'd put the original token in the same record used to hold the other information (num ham, num spam, atime). I'm assuming db_file and SQL can handle a variable-length string field and that their implementations do so in a reasonable manner. (That is, there would be no big performance difference between a table without variable-length string fields and one with a string field that only stores nulls.) For a new token an original token (OT) field would be initialized to the token or to a null, depending on the option. When reading a token the OT field would be ignored if the OT option were off. The OT data would only be used for statistics, so its absence would not affect normal operation. Turning the option on and off should present no problems because there is no change in database format. There are several ways of dealing with the stored tokens when the option is turned off. One is to immediately delete all the tokens. (Time consuming, especially for someone who turned OT off by accident.) Instead of immediately deleting the tokens a separate "wipe tokens" command could be provided. I'm assuming the OT option would only be kept by those who know what they're doing, and so the slight complication of a separate "wipe tokens" command is acceptable. > > 3) Maintenance issues. That's why the original token is a data field, and not a key. > 4) How do you handle the case where someone has this option turned > off, but then turns it on? See above. > > Patched welcome of course. I don't have SQL installed so there is no way I could test any changes for that back end. If no one else starts working on the changes in a few days I'll make them for dbfile (DBM).
Subject: Re: Bayes option to keep original token as db data (not key). On Thu, Apr 29, 2004 at 09:58:19AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > > 1) How do you handle backup/restore? > > I don't see the issue, other than having to add code to write or read > the original token. > So when you run backup do you get the hashed token or the original token? What if not all original tokens are available? > > > > 2) Where are you going to put this data? In a seperate db file/table > > or alongside the current data? Be careful, your answer may require > > a ton more work. > > I'd put the original token in the same record used to hold the other > information (num ham, num spam, atime). I'm assuming db_file and SQL > can handle a variable-length string field and that their > implementations do so in a reasonable manner. (That is, there would be > no big performance difference between a table without variable-length > string fields and one with a string field that only stores nulls.) You just changed the database format. It's easy enough to just add onto a value in DBDm but since we pack several items into a single value it does require a database format change. SQL does not use variable length columns for data storage anymore. So you would have to add a column to the bayes_token table to place this value in, so it would change the database format. > > > > Patched welcome of course. > > I don't have SQL installed so there is no way I could test any changes > for that back end. If no one else starts working on the changes in a > few days I'll make them for dbfile (DBM). > Like I said, I welcome any patches, I think you'll find the storage code slightly more complicated once you start digging into it. After thinking on this a little I have a possible proposal that involves a seperate db_file/table but I haven't worked everything out yet. I'll think on it some more and reply back to see if it interests folks.
Michael Parker wrote: > So when you run backup do you get the hashed token or the original > token? What if not all original tokens are available? Always the hashed token, the original token if it's available. The absence of the OT is no problem, just store a NULL. > > > 2) Where are you going to put this data? [snip] > > > > I'd put the original token in the same record used to hold the other > > information (num ham, num spam, atime). [snip] > > You just changed the database format. [snip] Yes, in that sense I did. What I meant by no new database format was that it would be the same format with the OT option on or off. > After thinking on this a little I have a possible proposal that > involves a seperate db_file/table but I haven't worked everything out > yet. I'll think on it some more and reply back to see if it interests > folks I'd appreciate it if Michael or anyone else working on this problem post here before starting.
Subject: Re: Bayes option to keep original token as db data (not key). On Thu, Apr 29, 2004 at 10:09:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > After thinking on this a little I have a possible proposal that > involves a seperate db_file/table but I haven't worked everything out > yet. I'll think on it some more and reply back to see if it interests > folks. Just to put it my two cents... I too will miss being able to tell what actual tokens are what, but if it gives a performance gain, fine. That information is largely, well, informational. What matters for scoring is that we see "block of data", not how we choose to represent said block. In general, we do _NOT_ want to add unnecessary complexity to this system. It's bad for performance, and it's horrid for maintenance. Adding the original token to the same DB is pointless, you'll eliminate all the performance benefits, at least in DBM. Adding another DB is doable, but fairly complex. Expiry is going to be even more complex than it is now, and take more time. Backup/Restore would just need to dump/restore the extra DB. The code would have to always access the new DB so that the original tokens can be accessed for dump and header display. Oh, and internally, we'd need to track both the hashed and the original. Oh, and the journal will have to be modified to have a "hash-to-text" line type. Not to mention issues with doing multi-word tokens. There's very likely more, but this scares me enough. In short, it adds much more complexity than I'm comfortable with. So unless there's a compelling reason, I'm -1 on the idea of keeping hash<->text data around.
I like the general idea of having the actual tokens available somehow for analysis. I'm concerned about possible effects on perfomance if they are in the main database. Michael's thoughts about getting them in a separate optional table may be the way to go. An important part of the optimizations was getting the records to be all fixed length. A quote from the MySQL manual section on optimization: "For MyISAM tables that change a lot, you should try to avoid all variable-length columns (VARCHAR, BLOB, and TEXT). The table will use dynamic record format if it includes a single variable-length column." Note that such a table of tokens would contain information that is more privacy sensitive than the Bayes database that has only hashes. That might effect sharing of database information. That said, I would like to see a suggestion that dealt with all the issues in a sufficiently compelling way to get Theo to change his mind about his -1. I just don't know what all would be in such a suggestion. There may be something coming out of the idea that all we are interested in having is the ability to usually look up the original text given a hash when doing some analysis, but the mapping does not have to be maintained, expired, etc., to the same degtree of precisions and reliability as the rest of the database.
'all we are interested in having is the ability to usually look up the original text given a hash when doing some analysis, but the mapping does not have to be maintained, expired, etc., to the same degtree of precisions and reliability as the rest of the database.' agreed. BTW, I know Dan was expressing some concerns last night about exposing too much of the bayes data, saying that even with that info, it's not much use for people to do stuff with -- sure, you know that token "foo.html" is a good spam sign, but in what way does that help, given that it's not really possible to actually edit the bayes db token-by-token? This is a good point and needs some discussion.
Oh, BTW -- I do think, however, that a degree of visibility into the bayes calculation, and the tokens it uses, and even the tokens in the db, is very valuable. It's part of the SpamAssassin "philosophy" to be open about how the classification was decided -- that's why we list the names of the rules that hit in the X-Spam-Status header, in the report, that's why we have rule descriptions, readable rules files, and so on.
Justin Mason wrote: > BTW, I know Dan was expressing some concerns last night about > exposing too much of the bayes data, saying that even with that > info, it's not much use for people to do stuff with -- sure, you > know that token "foo.html" is a good spam sign, but in what way does > that help, given that it's not really possible to actually edit the > bayes db token-by-token? For the end user providing db statistics showing the original tokens would serve the same functions as showing individual token scores for a message: it will help the user understand what SpamAssassin is doing so the user will have reasonable expectations for what it can and cannot do and will let the user know why he or she should take more care in which messages are learned. BTW, I myself consider this a weak reason for keeping the original tokens given that the tokens can already be shown in the message. The much more important reason for keeping the original tokens is for a source of ideas on how to make future improvements. For example, one improvement that has been suggested (and used elsewhere) is to tokenize word pairs. Examining the database might give hints about whether too many rare word pairs are being retained. Another question that can be examined is how much do random character strings (used in some spam) inflate the database? One can have a look. As for "a.html", perhaps tokenizing markup would help.
I've worked up a patch that allows the original text (which is called token text or ttext in the code) to be optionally kept in the database, this only works for DBM. The default value is off. The new keep_token_text option can be turned on or off at any time. While off token text will be deleted as tokens are encountered (rather than wiping the whole database), while on token text will be added as they are encountered, including for touches. I've modified the SQL backend to return empty strings for token text, BUT I HAVE NOT TESTED THIS CODE. This does not require a new format. I was considering using the format field (see FORMAT_FLAG near the bottom of DBM.pm) to indicate the presence of ttext in a packed token, but decided against it since that would preclude an additional compact token format (the format field is two bits). The code relies on DB_File to provide a packed string of the correct length. If it's longer than it should be it will be interpreted as token text (but this will only impact parts that display token text retrieved from the database, such as dump and backup).
Created attachment 1933 [details] Proposed implementation.
Subject: Re: Bayes option to keep original token as db data (not key). I'd prefer a "bayes_hash_tokens" option (default = turned on) over this that affected all storage methods: * on: hash tokens * off: no hashing or truncation of tokens Simple and easy to understand. Probably would require some hooks in some storage methods to change the format of the tokens, maybe a per-DB format flag. Code would not need to (and probably should not) support either/both formats.
It sounds a bit simpler, but there would be no way to go from hashed to non-hashed tokens without starting with an empty database. BTW, what do you mean by truncation of tokens?
What about having a tool that a corpus could be run through which uses the existing code in SpamAssassin to split up the input into tokens, hash them, and then writes to a database of tokens keyed on the hash. As a standalone program with a separate database it would have no effect on performance, but it would be available to be run when someone wants to be able to analyze a Bayes db in terms of the original tokens. If there are a few extra entries, that doesn't hurt, and if a few uninteresting tokens are missing that probably doesn't hurt either. This way there are no problems with configuration, backup/restore, etc., and the tool would not have to be run often. If it is set up to add entries to a dictionary, someone could run it periodically on saved ham and spam to keep a dictionary built up, or they could maintain their research corpus and when they want to analyze Bayes results for it run the tool over the corpus to make a fresh database.
Subject: Re: Bayes option to keep original token as db data (not key). > It sounds a bit simpler, but there would be no way to go from hashed > to non-hashed tokens without starting with an empty database. Yes, pick one or the other. We have to maintain this code and it's already too complicated.
Subject: Re: Bayes option to keep original token as db data (not key). > What about having a tool that a corpus could be run through which uses > the existing code in SpamAssassin to split up the input into tokens, > hash them, and then writes to a database of tokens keyed on the hash. Way too complicated.
While I'm not convinced that we need to do this, I'm willing to concede that it's something we'll get asked about over and over again. I've got a possible design that will make everyone happy here so I'm taking.
BTW, not to be a pain -- but are we planning to do the approach listed at bug 2266 comment 17? That seems like a pretty good way to be able to list the tokens in one common situation, ie. when scoring a message, without having to keep the token text in the db.
Subject: Re: Bayes option to keep original token as db data (not key). On Mon, May 03, 2004 at 11:50:13AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From jm@jmason.org 2004-05-03 11:50 ------- > BTW, not to be a pain -- but are we planning to do the approach listed at bug > 2266 comment 17? > > That seems like a pretty good way to be able to list the tokens in one common > situation, ie. when scoring a message, without having to keep the token text in > the db. We're already doing that.
Subject: Re: Bayes option to keep original token as db data (not key). -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >We're already doing that. OK, missed that. sorry ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAlp5mQTcbUG5Y7woRAqiEAJ9IHmW00kuDlcOUFe0ikOSkcN3EEACeJg3D 38jUTYyFv87x1y0YTPG2BvM= =RzHM -----END PGP SIGNATURE-----
An update on this bug, I've got about 90% of my proposed implementation done. Here are the basics: 1) Add a new config param bayes_ignore_raw_token, default is 0 2) New db format version 4: DBM - this adds an optional A* to the tokens packed value SQL - this adds a raw_token column, you can omit the column if you set bayes_ignore_raw_token to 1 3) In theory you can switch back and forth between ignoring and not ignoring the raw token value and everything will work. If you ignore for a period of time, when you stop ignoring the code will fill in any blank values as it updates a token (NOTE will only update when the spam/ham count changes, not when it is touched). Starting to ignore a previously unignored raw token won't remove it from the database, you would have to --backup/--restore to remove it totally. 4) --backup/--restore will behave correctly. The backup format will include the token value if you are not ignoring. Restore will parse the raw token value if available and insert into database if you are not ignoring. For SQL, advance users can opt to not include the raw_token column in their database schema. With bayes_ignore_raw_token set to 0 the code will never refer to that column so there should never be any sort of SQL error. However, if the user decides to stop ignoring then they will need to add that column back to the schema. Anyone game for trying to get this into 3.0.0? It would mean that we could possibly get away with not bumping the db version. Also, reduce the number of questions from folks about not seeing the actual token value via dump and what not.
Perhaps "bayes_keep_raw_tokens" would better indicate what it's doing (of course using 0 for don't keep).
Subject: Re: Bayes option to keep original token as db data (not key). On Wed, Aug 04, 2004 at 02:10:52PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > ------- Additional Comments From koppel@ece.lsu.edu 2004-08-04 14:10 ------- > Perhaps "bayes_keep_raw_tokens" would better indicate what it's doing (of course > using 0 for don't keep). > My plan was actually to store the raw tokens by default. Then let advanced users, who want the speed/space benefits, disable. I don't mind make the config param bayes_keep_raw_tokens but I think it should default to 1. I think it will cut down a good bit on support type questions, "Hey why can't I see my token data?" Michael
My concern wasn't the default setting, it was that it would be unlikely that anyone would guess the meaning of "ignore" without reading the documentation, and for those that don't look at configuration files everyday the meaning would be easy to forget.
Subject: Re: Bayes option to keep original token as db data (not key). Point taken, I'm ok with changing the name. Michael
Subject: Re: Bayes option to keep original token as db data (not key). bayes_(store|keep|save)_(original|raw)_tokens I think store > save >> keep and original >> raw.
I really like how all the effects of this addition disappear into no overhead if the option is set not to use it and the schema is set accordingly. I agree that the default should be set for low volume users who may want to see the tokens in order to better understand how Bayes works with their mail. If we do that that we should document somewhere that is *very* hard to miss that we have a set of suggestions for configuring SpamAssassin for high volume sites, and write something up about all the tweaks that might be necessary for high volumes. A HighPerformance entry in the Wiki would be fine if there is a pointer to it. I'm +1 on getting this in 3.0. If we don't then we have to deal with people wanting to get their Bayes tokens information back, and then deal again with letting them know that they have them again in 3.1 but that they have to change their schemas yet again. If we do it now we may get away with no schema change for 3.1.
Created attachment 2226 [details] Patch File This implements the feature pretty much as discussed. The config option has been changed to bayes_store_original_token and it defaults to on. Opted to keep the db version at 3, this does create some support problems for those who have 3.0.0 up and running with SQL, but easy to overcome. We will need a LARGE NOTICE in the release notes for the next pre/rc release.
Please review for inclusion in 3.0.0. I've still got some testing to do 1) upgrade db from version 2 databases, 2) make sure DBM v3 dbs pick up the changes ok, I know SQL v3 dbs are gonna need an ALTER TABLE command. I don't anticipate any problems. Speed/size numbers are interesting, for DBM you get a 9% slowdown and a 96% increase in db size. For SQL the speedup is statistically insignificant, there is a size increase, however I haven't measured it yet. I'll report more numbers once I run some more benchmarks.
Subject: Re: [review] Bayes option to keep original token as db data (not key). Is there a limit on the size of original tokens? How much does this slow down DBM when the option is turned off vs. when the patch has not been applied? If you turn the option off, are original tokens removed over time? I assume when upgrading from version 2 that the original tokens will be stored in this field if the option is turned on. Did you try ignoring hapaxes (single count tokens) to reduce the size impact?
> Did you try ignoring hapaxes (single count tokens) > to reduce the size impact? That's an interesting possibility. Since the hash can be considered to uniquely identify the token you could not store the token when creating a new entry, then put it in the second time the token appears. I think that should be an option if we do it, so people can choose the degree of tradeoff between completeness of inforamtion about the tokens and disk space: Minimum space with no token informaiton; Medium space with information on non-hapaxes; Maximum space with with information on all tokens.
BTW one thing about hapaxes is that they have quite a large effect on the resulting score. (wierd but true.) This means that a dump without hapaxes may still not be so useful, since they're not "insignificant" in other words.
Subject: Re: [review] Bayes option to keep original token as db data (not key). > BTW one thing about hapaxes is that they have quite a large effect on > the resulting score. (wierd but true.) This means that a dump > without hapaxes may still not be so useful, since they're not > "insignificant" in other words. Urgh. Let's avoid options if at all possible. The sheer count of options is very oppressive to new users.
I agree about not having unecessary options. _If_ we drop hapaxes it should be an option. Therefore I wouldn't want to see dropping hapaxes.
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Sun, Aug 08, 2004 at 10:45:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > Is there a limit on the size of original tokens? > Not currently, I'm contemplating a limit, at least on the SQL side, of 128 or 200. Opinions? > How much does this slow down DBM when the option is turned off vs. when > the patch has not been applied? > Hmmmm..anywhere from 0-7%, it was actually faster for the spamd phase. I'm inclined to run the test again, something seems fishy but it's possible I've got some innerloop if statement that is causing a problem. > If you turn the option off, are original tokens removed over time? > Yes, as tokens are updated they will be updated with blank raw_token values. > I assume when upgrading from version 2 that the original tokens will be > stored in this field if the option is turned on. > That's the idea, haven't tested this yet. > Did you try ignoring hapaxes (single count tokens) to reduce the size > impact? > I forgot about this, very easy to do, just a matter of running some benchmarks. I'm still running benchmarks, each one takes 50min to 1hr to run. 3 times each type of test times 3-4 tests per storage engine takes a little while. Michael
I'm in favor of a limit to protect against a possible DoS and also to provide a rational basis for choosing the size of the VARCHAR field in the db schema. There is no reason for using tokens larger than that field. If we did it would break the mapping between the tokens in the db and the hash. I don't care if it is 128 or 200.
Subject: Re: [review] Bayes option to keep original token as db data (not key). Just to make it clear what I'm thinking, I won't limit the token size until after the SHA1 hash has been computed. So there might be a largish token whose SHA1 hash will match exactly the original size but the data stored in the database will be shorted to 128 or 200 characters. Michael
> I won't limit the token size until > after the SHA1 hash has been computed No, I don't like that at all. It doesn't protect against a DoS in the easiest place to protect against it, and it leaves the database with entries that do not have the useful two-way mapping between token and hash. It would be much simpler to truncate tokens before hashing. There is not going to be any statistical difference between two tokens that have the same 128 or 200 byte prefix as far as heir being spam or ham signs.
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Mon, Aug 09, 2004 at 12:32:55PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Just to make it clear what I'm thinking, I won't limit the token size > until after the SHA1 hash has been computed. So there might be a > largish token whose SHA1 hash will match exactly the original size but > the data stored in the database will be shorted to 128 or 200 > characters. Yeah, that sounds like a good idea. The plaintext token is meant to be informational at this point, so I think even limiting small (32-96) is fine. Perhaps make truncation put "..." at the end or something to note that the token was truncated.
> The plaintext token is meant to be informational at this point, > so I think even limiting small (32-96) is fine. I'm ok with that as long as there is some truncation at the front end too so that the token size is not open-ended. Umm, maybe that already is the case because of the way tokens are generated? Do any other limits on header size or line size in bodies or something else set a natural limit on token size? If the way we have things coded now means that super large tokens will not DoS the parsing of tokens or the computation of the hash, then I am fine on truncating at the backend: The "informational only" argument isa convincing to me.
Subject: Re: [review] Bayes option to keep original token as db data (not key). > Perhaps make truncation put "..." at the end or something to note that > the token was truncated. That could be done at display time, but I think it will be quite clear that there's truncation by just looking at the values. No reason to put three bytes in the databaes. :-) Daniel
tokens are already stringently limited in size -- even lower than that. (16 bytes plus a prefix is the typical truncation point.) there's a little code to preserve really important data that we want to be able to dump out -- such as IP addresses of relays found, From addrs, To addrs, etc. so truncate to 128 bytes if you like, that should be loads of room.
In the relatively rare case of the field being exactly the maximum length, the code can tell if there was truncation by computing the SHA-1 hash and seeing if it matches the stored hash. In that case, displaying "..." after the maximum length would be unambiguously different from any real string that ends in "..." Given what Justin said about the token size being limited, I have no objections to this design, including truncating the stored tokens to 128 bytes and defining the field in the schema as VARCHAR(128).
Subject: Re: [review] Bayes option to keep original token as db data (not key). I'm going to KISS, will truncate at 128 and won't put any sort of '...' indication unless someone starts screaming about it. Michael
Created attachment 2233 [details] Patch File Not the final patch, just some updates to speed things up.
Removing [review] for now since I'm making some slight changes to the patch.
Created attachment 2236 [details] Patch File
Ok, I'm happy with the patch now. I will say that the Bayes storage code is pretty well tuned and highly sensitive to even simple statements (ie if (foo)). This code causes an approximately 2-3% slowdown of the storage code for DBM. Still have a couple of tests to run, but it's ready to review.
+0.9: why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change? + (my $unpacked_token, $raw_token) = split(/\s+/, $token, 2); split (/ /, ...) would probably be faster, if you can make that assumption. ditto in the @@ -1594,14 change. ...BTW, you were asking what term to use for configuration settings; looks like you went for "option". worth noting that Apache uses "directive".
Subject: Re: [review] Bayes option to keep original token as db data (not key). > split (/ /, ...) would probably be faster, if you can make that assumption. > ditto in the @@ -1594,14 change. Maybe split (' ', ...) instead of that?
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Wed, Aug 11, 2004 at 01:52:00PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From jm@jmason.org 2004-08-11 13:51 ------- > +0.9: > > why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change? > The tok_touch code used to update the raw_token so I had made bindings dynamic, then I removed that since a) tok_touch is never actually called and it wasn't easily done in tok_touch_all and b) wasn't really necessary. However, I left the dynamic nature of the bindings in, just in case there was some future use. I could back it out if it bothers folks. > + (my $unpacked_token, $raw_token) = split(/\s+/, $token, 2); > > split (/ /, ...) would probably be faster, if you can make that assumption. > ditto in the @@ -1594,14 change. > The backup format has \t between the elements, however I've seen some unpack(H*) tokens with spaces at the end, so we need to check for 1 or more whitespaces. Michael
Subject: Re: [review] Bayes option to keep original token as db data (not key). -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > > split (/ /, ...) would probably be faster, if you can make that assumption. > > ditto in the @@ -1594,14 change. > > Maybe split (' ', ...) instead of that? no -- the speed profiles would be split (' ', ...) is slower than split (/\s+/, ...) is slower than split (/\s/, ...) is slower than split (/ /, ...) ' ' has various additional semantics (like skipping whitespace at start of line and EOL iirc). - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBGo+4QTcbUG5Y7woRAutWAJ4z57Nd0Kzn4/EUbFmlzk8jmcC1DwCZAUr7 xcOHWflN65dosl7CWPb7WC8= =oL+L -----END PGP SIGNATURE-----
Subject: Re: [review] Bayes option to keep original token as db data (not key). -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > > why the @bindings temp copy in the @@ -1033,9 +1036,10 @@ change? > > The tok_touch code used to update the raw_token so I had made bindings > dynamic, then I removed that since a) tok_touch is never actually > called and it wasn't easily done in tok_touch_all and b) wasn't really > necessary. However, I left the dynamic nature of the bindings in, > just in case there was some future use. > > I could back it out if it bothers folks. No -- just wondering why a temporary copy was used in that one function. I can see in the other function that uses it, it does chnage the contents of @bindings, but in that one function, it just uses it as-is, like so: @@ -1025,6 +1026,8 @@ return 0 unless (defined($self->{_dbh})); + my @bindings; + # shortcut, will only update atime for the token if the atime is less than # what we are updating to my $sql = "UPDATE bayes_token @@ -1033,9 +1036,10 @@ AND token = ? AND atime < ?"; - - my $rows = $self->{_dbh}->do($sql, undef, $atime, $self->{_userid}, - - $token, $atime); + @bindings = ($atime, $self->{_userid}, $token, $atime); + my $rows = $self->{_dbh}->do($sql, undef, @bindings); + unless (defined($rows)) { dbg("bayes: tok_touch: SQL Error: ".$self->{_dbh}->errstr()); return 0; note that the assignment to @bindings comes immediately before its only use in that fn, so it seems superfluous to use it there instead of just my $rows = $self->{_dbh}->do($sql, undef, $atime, $self->{_userid}, $token, $atime); > > + (my $unpacked_token, $raw_token) = split(/\s+/, $token, 2); > > > > split (/ /, ...) would probably be faster, if you can make that assumption. > > ditto in the @@ -1594,14 change. > > The backup format has \t between the elements, however I've seen some > unpack(H*) tokens with spaces at the end, so we need to check for 1 or > more whitespaces. ok. shame ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBGpCAQTcbUG5Y7woRAmQhAKCtlPyStdh0jSR3aPJAsKBq3DOf1gCgxDSc FbN2/rGmsmR06ZD15FHCst4= =vJfV -----END PGP SIGNATURE-----
Created attachment 2238 [details] Patch File Fixes a bug when upgrading from a 2.64 database, was ignoring the bayes_store_original_token setting. Also removes the extra @bindings change in tok_touch since it isn't necessary.
+1
Okay, I need to get this out. This is excellent code. I love the patch. I have major concerns about adding this option, though: 1. In the best case (turned off) it slows down DBM Bayes by adding 3.3% more time. I don't really care about the extra time and space when it is turned on. In many sites, it's not even possible for users to dump the Bayes database so this is pure overhead. HAMMYTOKEN and SPAMMYTOKEN also work without it (although I should note I think we should drop those too). 2. It's another option to support. 3. I don't think it's needed. We had a few questions when we released pre1 and before that, but I've gotten used to not looking and nobody really needs to look. 4. It complicates the code. If we were a commercial project, there would be no bloody way this would go into the code. Seriously, only nosy technical people care and apparently not for too long judging by the 3.0 experience so far.
I think leaving it out, removes functionality that a lot of people *do* expect. we'd be pretty much the only bayes impl apart from dspam and CRM-114 (the n-gram systems) that don't support dumping, in that case. However I could compromise by setting it off by default, and noting clearly in the doco how to turn it on...
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Wed, Aug 11, 2004 at 11:21:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I think leaving it out, removes functionality that a lot of people *do* expect. > we'd be pretty much the only bayes impl apart from dspam and CRM-114 (the n-gram > systems) that don't support dumping, in that case. Well, that's not really true. There are tons of Bayes-style filters out there (how many MUAs have built-in Bayes-style filters?) which don't do dumping. If you compare us to other OSS projects, you could be right. However, I do agree with Quinlan in that normal users won't care. The more technial folks, or the folks who are more involved with things like running sa-learn may care. For example, I was upset when the hashing first went in, but then I realized I have better things to do with my time than going through the tokens in my Bayes DB. The rest of the users on my box, however, haven't said anything, leading me to think they don't ever look at the dump output. > However I could compromise by setting it off by default, and noting clearly in > the doco how to turn it on... The issue, though, is the performance penalty even with the setting off.
Created attachment 2240 [details] Data Collection Spreadsheet Here is a spreadsheet from my benchmark runs. All numbers are in seconds. Let me know if you have any questions. A brief description of the benchmark: Phase I: Learn 2000 ham and 2000 spam Phase II: Scan 2000 ham and 2000 spam using spamd and a parallel spamc processors Phase III: Force Expire Phase IV: Force first 1000 ham and 1000 spam from Phase I Phase V: Us spamassassin to scan 2000 ham and 2000 spam BTW, disk space requirements basically double between the store and non-store cases, that was part of the spreadsheet but I got tired of tracking it.
Created attachment 2241 [details] Data Collection Spreadsheet Fixed a couple of stupid cut and paste typos. Everything I said in the previous note still applies.
Created attachment 2245 [details] Data Collection Spreadsheet I must have been half asleep when I labels some of those columns last night, please disregard the other spreadsheets, they are totally mislabled.
Maybe it's just a bad idea to integrate this into SpamAssassin directly due to the 3-4% performance drop, perhaps a separate script/application which calls the tokenizer and can generate a mapping database is the right solution for the curious people out there.
A slowdown of 3-4% when all that has to be done is test a flag to skip over code makes no sense to me. Where is the overhead happening? Could that extra overhead be a bug or a design error?
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Thu, Aug 12, 2004 at 03:22:26PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From sidney@sidney.com 2004-08-12 15:22 ------- > A slowdown of 3-4% when all that has to be done is test a flag to skip over code > makes no sense to me. Where is the overhead happening? Could that extra overhead > be a bug or a design error? No clue, if you can find it I'd love to know where. Michael
Subject: Re: [review] Bayes option to keep original token as db data (not key). -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > > A slowdown of 3-4% when all that has to be done is test a flag to skip > > over code makes no sense to me. Where is the overhead happening? Could > > that extra overhead be a bug or a design error? > > No clue, if you can find it I'd love to know where. I think it may be at a higher level -- passing a parameter down through 3 levels of method calls (?) does have an effect on runtime, esp. when the lowest level is called very frequently, as tok_pack()/tok_unpack() are. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBG/FdQTcbUG5Y7woRAqh1AKC8Ei6NGlGdL0eeg2Q34/eyIZRlGgCdHEG5 pGnQlaJlClFBqTJ8djck/fg= =TLq3 -----END PGP SIGNATURE-----
> passing a parameter down through 3 levels of method calls (?) > does have an effect on runtime, esp. when the lowest level > is called very frequently, as tok_pack()/tok_unpack() are. That's what I mean... If there is some test being repeated inside some very small inner loop that makes _that_ much differnce to the runtime, I would think that it could be lifted out with minimal duplication of code. Even more important, this may point to some place that can be optimized to improve performance. What if it turns out that a tweak to the inner loop could give us a 3-4% _improvement_ over the existing code? Unfortunately I don't have the time right now to put into this. Can one of you more perl-wizardly folks who is handy with a profiler see if there is anything obvious to tweak?
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Thu, Aug 12, 2004 at 04:11:43PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > passing a parameter down through 3 levels of method calls (?) > > does have an effect on runtime, esp. when the lowest level > > is called very frequently, as tok_pack()/tok_unpack() are. so I haven't looked recently, but what am I missing here? tok_pack gets called with the raw data, it does the SHA1 and the raw data if necessary. the rest of the time, the raw token ought to be passed around internally anyway. this shouldn't be a big time suck.
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Thu, Aug 12, 2004 at 05:22:47PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > so I haven't looked recently, but what am I missing here? tok_pack gets > called with the raw data, it does the SHA1 and the raw data if necessary. > the rest of the time, the raw token ought to be passed around internally > anyway. > > this shouldn't be a big time suck. > The SHA1 happens in tokenize, as the code stands today the tok_pack/tok_unpack methods have no idea what the raw token value is. The new code passes the raw_token down the various levels, along with the SHA1 token and when it gets to tok_pack packs it into the value portion with the SHA1 token as the key. Michael
Subject: Re: [review] Bayes option to keep original token as db data (not key). > I think it may be at a higher level -- passing a parameter down through 3 > levels of method calls (?) does have an effect on runtime, esp. when the > lowest level is called very frequently, as tok_pack()/tok_unpack() are. If this is performance critical and the bottom routine or two are fairly small, might it make more sense to duplicate the bottom routine or two without the extra parameter and decide which routine to call at a higher level? This would eliminate passing the parameter to the most-called routines, and possibly would move the test decision to some place less often than once a token. Loren
I just had a quick look at this thread, but if passing variables through functions is really that expensive in Perl, maybe we should work with a bunch of instance-global state-variables then? I'm no fan of global variables but if it really gives a performance improvement, they should be ok. And shouldn't have any drawbacks as one instance of the objects is used for a single message at a time only (I think).
Subject: Re: [review] Bayes option to keep original token as db data (not key). It's all of the sanity checking, for instance: In tok_pack: + # raw_token should be limited to 128 characters and should not be undefined + $raw_token = (defined($raw_token) ? substr($raw_token, 0, 128) : ''); In tok_sync_counters: + # if a $raw_token didn't get passed in, but there was one set keep the original + # but only if we want to store the original token. This allows raw token values + # to filter out of the database when the option gets turned off. + if (!$raw_token && $self->{bayes}->{conf}->{bayes_store_original_token}) { + $raw_token = $orig_raw_token; + } + In tok_count_change: + $raw_token = '' unless (defined($raw_token)); + Sometimes $raw_token is checked 3-4 times before it finally gets packed. It needs to be that way because of the spaghetti code in DBM.pm. If we backed up and cleaned up the internal methods a little we might could make things better. Michael
Subject: Re: [review] Bayes option to keep original token as db data (not key). -1 on the patch and feature Believe me, I don't really like doing it, I spent a large amount of time coding it up and testing/benchmarking it all, but I feel it needs to be done. Even after a call for comments on the -users list there was little comment and no one coming forward with a compelling reason for keeping the token values around. The patch in it's best case causes a 3-4% slowdown, for some things it's a 45-48% slowdown. Even stripping out what I would call critical sanity checking code there is a 1-2% slowdown and I'm no longer confident in the data being stored in the database. Perhaps it's enough to have done the patch and placed it in this bug report, then when someone comes up with a reason we can refer them to the patch and they can apply it locally. Obviously it wouldn't be officially maintained, but that doesn't mean someone else couldn't maintain the patch and re-attach as needed. Michael
my commiserations. ;) I think we should at least release an RC build with the "opaque tokens" code as we currently have it in SVN, and see if any people find the lack of this feature to be really annoying... just to give the change a little more visibility. Perhaps point it out in the change notice. It really does appear that nobody misses it very much, and the benefits (speed) outweigh the loss (the feature), in which case we should leave this code out as you suggest. (BTW Dan's suggestion -- an external tool that can "fill in" an optional "side" database of hash->raw_token mappings for sa-learn --dump to display -- may still be an option in future. but I think that can be implemented in a non-invasive way.)
I agree for not putting this in to 3.0. An addon to dump tokens and generate a dictionary is always an option as long as the original source of the email exists. That should take care of anyone who wants to research the Bayes stuff. The information is not lost just because it is not stored in the database as long as there is the corpus that it came from, and anyone researching Bayes would need to keep the corpus anyway. I still think that the results of this attempt may point to some optimization that we can get when someone has the time to pursue it. Michael referred to the "spaghetti code" in DBM.pm as being in the way of removing parameters being passed down through many levels and having to check them in many places. Perhaps this points to benefits from cleaning that up. Of course this would be for some version past 3.0.
Created attachment 2249 [details] Patch File Different concept, thanks to Daniel and Justin. Added a plugin hook to Bayes.pm after each call to tokenize. It passes in the tokenize hashref, what function we're currently performing and if the message isspam or ham for the learn and forget functions.
Please review. I'm removing my veto. With this patch it is very easy to setup a plugin to save off the data you are interested in.
+1 -- like it ;)
if we're going to do a plugin (+1 on that idea btw), we should make it a little more generic imo. a few things come to mind immediately: - bayes_tokens could just be bayes - scan should wait until after the probability calculations to call the plugin. that way the plugin can get the ham/spam count, probability, msgatime, etc. - we can pass in the msgid (it's generated) - learn has msgatime as well (forget doesn't need it)
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Fri, Aug 13, 2004 at 07:45:03PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From felicity@kluge.net 2004-08-13 19:45 ------- > if we're going to do a plugin (+1 on that idea btw), we should make it a little more generic imo. > > a few things come to mind immediately: > > - bayes_tokens could just be bayes > - scan should wait until after the probability calculations to call the plugin. that way the plugin can get > the ham/spam count, probability, msgatime, etc. > - we can pass in the msgid (it's generated) > - learn has msgatime as well (forget doesn't need it) > I can go along with that. Should there be three plugin calls? bayes_scan bayes_learn bayes_forget Since each one will pass in slightly different data, or keep the "function" key that folks can use to do the right thing? Michael
Subject: Re: [review] Bayes option to keep original token as db data (not key). On Fri, Aug 13, 2004 at 07:55:46PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I can go along with that. Should there be three plugin calls? > bayes_scan > bayes_learn > bayes_forget > > Since each one will pass in slightly different data, or keep the > "function" key that folks can use to do the right thing? The three functions sound good. Potentially that function could be called 3 times in the same run (scanning an already learned message, autolearn tries to learn it the other way, needs to call forget ...) So it would be cleaner to separate them out imo. :)
Subject: Re: [review] Bayes option to keep original token as db data (not key). > The three functions sound good. Potentially that function could be > called 3 times in the same run (scanning an already learned message, > autolearn tries to learn it the other way, needs to call forget ...) > So it would be cleaner to separate them out imo. :) I'd also like there to be a way for a single "get the tokens" method to be easily called just once per message without tracking too much state in the plugin (whether we're learning, forgetting, or scanning).
I was just thinking, calling the plugin on dump() would be good too. perhaps something like: dump_start() - happens before data is given out dump_data() - called per token output line. can perhaps return something to be appended to the dump output? dump_end() - happens before dump returns
I agree that the 3 functions approach sounds good; having 1 API that receives different combinations of parameters depending on when it's called is a bad API design, in my opinion. (explanation: it's effectively 3 different APIs with different results and args depending on what method it's called from. By mixing them into one plugin API, you move the work of ensuring the right code is run, from the API-design level to the user-implementation-code level; in terms of Rusty Russell's "interface simplicity spectrum" that's from level 5 to level 7. http://sourcefrog.net/weblog/software/aesthetics/interface-levels.html http://www.ozlabs.com/~rusty/ols-2003-keynote/ols-keynote-2003.html Plus, there could be cases where users don't *want* to know about scan operation token use, just learn-op token use. This allows them to override just one of the APIs and ignore the other.) Also: regarding 'a single "get the tokens" method' -- either the plugin implementor would have to track the state, or Bayes.pm would have to track the same state. I think better let the plugin implementor do it. However, here's a good way to avoid making it too hard -- add these APIs: - bayes_start_message($isspam, $generatedmsgid) - called when a single message is about to be learned - bayes_scan_tokens($toksref, $msgatime) - pass the list of tokens from that message, for a scan op (do we have $msgatime here?) - bayes_learn_tokens($toksref, $msgatime) - pass the list of tokens from that message, for a learn op - bayes_forget_tokens($toksref) - pass the list of tokens from that message, for a forget op ($msgatime is irrelevant) - bayes_finish_message() - called when that single message learning op has been completed The bayes_start_message() and bayes_finish_message() APIs allow the plugin to know when a new message is being operated on. So the call order would be for example: $plugin->bayes_start_message(...); $plugin->learn_tokens(...); # dump the tokens $plugin->bayes_finish_message(...); $plugin->bayes_start_message(...); $plugin->forget_tokens(...); # msg was previously learned # dump the tokens $plugin->learn_tokens(...); # we've already dumped 'em, not again $plugin->bayes_finish_message(...); So an implementor would just have to keep a single boolean that's set to 1 during a foo_tokens() call, set to 0 during bayes_start_message() -- then if foo_tokens() find it already set to 1 it doesn't dump the tokens because that's already been done once on that msg. Theo: dump plugin APIs would be useful, good point -- maybe let dump_data() rewrite the line arbitrarily. Suggest something like this API: $line = dump_data({ origline => ..., prob => ..., ts => ..., th => ..., atime => ..., encoded_tok => ... }) (in other words, give it the *unstringified* data as well so that it doesn't have to parse them out of the line.) And then that's probably enough APIs for now ;)
Created attachment 2252 [details] Patch File Split bayes_token call out into seperate plugin calls for bayes_scan, bayes_forget and bayes_learn. Needs a little better docs for Plugin.pm. (define atime and format of toksref specifically) I'm +1 with updated docs, since I may be out of touch for the next couple of days someone can feel free to pick up the patch and fix.
ok, since this adds the basic APIs we need, and this is the last bug we need to fix before we can do an rc1, I suggest: 1. we vote and apply this patch 2. we fix up the documentation a little in C-T-R doco-fix mode afterwards. (in particular, the format of toksref needs to be documented in Plugin.pm.) my vote: +1
we may want to also note that if people want to get the raw/final token mappings, doing it in learn() is the way to go. anyway, per our discussion on irc, I'm +1 on 2252 for now, and we'll have to work more on bayes plugin api for 3.1 (ideas were floated about allowing a complete tokenizer plugin, etc.)
Created attachment 2254 [details] Patch File New patch with doco update. I'm still +1, feel free to commit if I'm not around.
+1 on 2254
+1 on the API, I don't have the time to review the patch itself.
+1 on the new patch anyway
Committed revision 36447.
Subject: Re: [review] Bayes option to keep original token as db data (not key). +1 looks good to me (waited for off-line benchmark results from Michael)