SA Bugzilla – Bug 3225
RFE: Bayes optimizations
Last modified: 2004-11-01 08:16:47 UTC
Discussions about optimizations for Bayes in the version 3.0 timeframe should be placed in comments in this ticket, which will also provide a central place to attach suggested patches, test, and test results.
Created attachment 1874 [details] Patches for SHA1 hash as CHAR(5) and tok_get_all tests The attached patches against SVN rev 9804 (1) replace the Bayes tokens with the low order 40 bits of their SHA-1, allowing them to be stored as CHAR(5) BINARY; (2) implement a tok_get_all method to be used instead of tok_get so that all tokens in a message are selected with one query to the database. The new code is enabled by two constants in Bayes.pm use constant HASH_TOKENS_FOR_DB => 1; use constant USE_TOK_GET_ALL_OPTIMIZATION => 1; If HASH_TOKENS_FOR_DB is 1 the schema should specify that the token field in the bayes_token table is CHAR(5) BINARY. This patch defines a new SHA1 function in SHA1.pm called sha1raw that returns Digest::sha1(), i.e., the SHA1 hash in binary. I did not bother to implement the slow perl version, so testing with this requires Digest::SHA1 to be installed. If we are going to use this, someone will need to add the trivial code to call the slow perl version that we already have in SHA1.pm. I did some testing using the truncated SHA-1 hash stored as a BIGINT instead of CHAR(5), but the small improvement in speed did not seem to justify the extra disk space plus extra code needed to get a 64 bit integer into and out of the database when perl doesn't have a 64 bit int type.
Note that the tok_get_all in the patch queries all of the tokens extracted from a message at once without checking if the resulting SELECT is too large for MySQL. I did not test with Michael Parker's suggestion of querying 25 at a time.
Subject: Re: RFE: Bayes optimizations On Sun, Mar 28, 2004 at 05:37:14PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > ------- Additional Comments From sidney@sidney.com 2004-03-28 17:37 ------- > Note that the tok_get_all in the patch queries all of the tokens extracted from > a message at once without checking if the resulting SELECT is too large for > MySQL. I did not test with Michael Parker's suggestion of querying 25 at a time. > I'm currently finishing up a fairly large hunk of storage (sql and dbm) optimizations/changes. One is Sidney's tok_get_all with batching. Another is moving token_count and the newest/oldest token age values into the bayes_vars table (to avoid some table scans). It implements a cache to avoid having to go to the database for multiple items. Removing some dead code and general cleanup. They span a fairly large range, and will require a schema change. I hope to have everything finished up this weekend, assuming I don't get pulled away for something else. One things of note, I've got what I consider a fairly decent benchmark script now that I've been using for my testing. Hopefully I can package it up for others to use as well. It makes comparing changes and different storage backends very easy. Michael
Created attachment 1890 [details] Patch File Here is a version that does several things: 1) Implements Sidney's tok_get_all method for SQL and DBM. Right now the SQL version will get the tokens from the DB in chunks (100, 50, 25, 5, 1) which needs to be benchmarked and tweaked based on what works the best. 2) Removes several full table scans to find the token_count and newest/oldest token atimes by moving those values into the bayes_vars table. 3) Removes some code that is no longer called. 4) Adds some basic caching to avoid multiple lookups. This patch does change the SQL database version so you can not use it without wiping your existing data and starting from scratch (for the DB savy it is possible to alter the bayes_vars table to add the new columns and then populate them with the right values and bump the db version, but I'll leave that as a lesson to the reader). I'm hoping to get the backup/restore stuff done before checking this in to help folks who are already using this do the upgrade without too much grief. Performance wise, my tests (via the benchmark) show a 2-3 times speedup from the old code. Compared to DBM it is about twice as slow for sa-learn operations and statistically even for scanning. The IO requirements should be much smaller, and my casual testing are much lower for SQL than for DBM. I'd love to hear some feedback from folks as to how well this works in your setup. Once I get some time I'd like to get folks using the benchmark I made and hopefully extending it (for instance I'd love to start doing some concurrent sa-learns and scanning to see how we do there).
Subject: Re: RFE: Bayes optimizations *grumble* I forgot to bump the version number in bayes_pg.sql and bayes_sqlite.sql, I'm not gonna bother with an updated patch at the moment. If you're using Postgres or SQLite you will need to update those by hand for the moment. Michael
Michael, I don't want you to forget the other optimizations: 1) Use an INT(11) for username, using the UID instead of a variable character string 2) Express atime as a two byte int with granularity of one day instead of a 4 byte int with granularity of a second. That not only cuts two bytes off of the atime in every record, but it means that the record is updated for the time no more than once a day. 3) The more radical change, use CHAR(5) from SHA1 hash for the tokens. Cutting down the database size by that much has to make a big difference, not to mention the optimization of using all fixed length records and keys.
Subject: Re: RFE: Bayes optimizations On Thu, Apr 15, 2004 at 04:17:19PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From sidney@sidney.com 2004-04-15 16:17 ------- > Michael, I don't want you to forget the other optimizations: I haven't forgotten, I was shooting for some low hanging fruit and correct some bad design decisions (hindsight is 20/20). > 1) Use an INT(11) for username, using the UID instead of a variable character string > Pretty much done. The plan, since not everyone will have a system UID, is to create entries in bayes_vars per username and give them an id (via sequence). This requires some additional work on DBs that don't easily support AUTO_INCREMENT but shouldn't be too bad. I may also add a seperate table and make the mapping query customizable to help support folks who want to map IDs to an existing system or provide a way to limit who can use bayes (Bug 3215). > 2) Express atime as a two byte int with granularity of one day instead of a 4 > byte int with granularity of a second. That not only cuts two bytes off of the > atime in every record, but it means that the record is updated for the time no > more than once a day. I'm unsure on this one. Anyone else have any opinions? > 3) The more radical change, use CHAR(5) from SHA1 hash for the tokens. > Close to done. Gonna have to encode this binary data on dump and backup and then recode it on restore, but I figured out a nice way to handle that I think. Will be a bummer to lose the ability to see the raw token data, but I suppose if it makes things smaller/faster it will be worth it. Anyone know how well Berkeley DB handles binary keys? I think we'll need to do a 10-fold cross-validation test after we put this change in to make sure we didn't degrade our results. Michael
Subject: Re: RFE: Bayes optimizations -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> 2) Express atime as a two byte int with granularity of one day instead >> of a 4 byte int with granularity of a second. That not only cuts two >> bytes off of the atime in every record, but it means that the record is >> updated for the time no more than once a day. > >I'm unsure on this one. Anyone else have any opinions? I'm really unsure it's a good idea -- we did find that the precision of the atime values was valuable for effective expiry. (The original rev of atime was a 2-byte int with granularity of several hours.) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAgEegQTcbUG5Y7woRAsahAJsEJX5De/aNKChF8SVtckaQi5nFxwCfYhO5 izVftF+wtW/UxSIqHnSFGb0= =Bt3j -----END PGP SIGNATURE-----
> we did find that the precision of the > atime values was valuable for effective expiry. I must be missing something. How is the time more than a rough guide to deciding which tokens to expire? Do tokens accumulate so quickly that expiration has to be performed more often than every two hours? What didn't work when you tried the two-hour granularity?
Subject: Re: RFE: Bayes optimizations -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > > we did find that the precision of the atime values was valuable for > > effective expiry. > > I must be missing something. How is the time more than a rough guide to > deciding which tokens to expire? Do tokens accumulate so quickly that > expiration has to be performed more often than every two hours? What > didn't work when you tried the two-hour granularity? We found that it adversely effected expiry, causing it to fail to be able to expire due to too many messages being combined into the same "bucket", adversely effecting expiry's ability to find a good cut-off. Now that I think of it, though, we may have been using the *scan* time instead of the message's origination time (from Received hdrs); this may not be an issue with the latter... The best thing to do here is to test it out, using mass-check I think. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAiCxDQTcbUG5Y7woRAv9lAJ9GZinMMEIfF+Uv/s76AFe19kSXwgCfURU9 M7V6PPkLRzgwl9Ka4bbukWM= =gydk -----END PGP SIGNATURE-----
Created attachment 1917 [details] Patch File I'm not 100% on this code yet. I think it's pretty much code complete but I want to run some more tests and it's been awhile since I tested anything but MySQL with the changes. Also, I'm seeing some cases where the SQL backend is learning more tokens (like 25) then DBM and it's causing a slight ripple effect in my benchmark. No idea why that is. Here is my current work in progress that makes the following changes to Bayes and the DBM/SQL storage modules, most of these things are on the SQL storage only, so unless I specify assume SQL only: 1) Implements a tok_get_all that fetches all of a msgs tokens from storage at one time. This was for DBM and SQL modules. There is still some possible tweaking on the SQL side to get the optimum number of tokens with each fetch. Need to gather some stats on the average number of tokens per msg. 2) The SQL storage now keeps a running tally of the number of tokens, the newest token age and the oldest token age, rather than doing a full table scan each time we needed the information (practically every time we even looked at the Bayes code). This added some overhead to the learn process (roughly 2 times slower) but scan time went down a whole bunch (5-6 times faster in my benchmark). 3) Slight short circuit in the expiry_due code, don't bother fetching variables from the database if bayes_auto_expire is false. This helps both storage modules. 4) On the SQL side we now clean up tokens whose spam_count and ham_count are both 0. Previously we just ignored them and cleaned them up at expire time. This slows down processing on forgets because we have to do the deletes. There is some room for improvement in this code, see the note in the source. 5) Cache some values from the database (ie db_version). There is probably a lot more caching that we could do but when I went down that road performance went down so I backed it out in favor of KISS. I'm certain that future work in the caching area can improve the speed of the code. 6) Dead code removed. I think there is more dead code that can be ripped out (ie tok_get) but I wanted to make sure first. 7) We now store all user data by an integer userid. All columns who's key included username have been switched over to id. The bayes_vars table is the map from username to id. NOTE: this pretty much requires and AUTO_INCREMENT/Serial column in the DB, which is doable via triggers and what not, but requires more work for the administrator. Everything should in theory work out of the box for MySQL and PostgreSQL. 8) Dropped SQLite table defs from the sql dir. Honestly, I do not believe folks are going to use SQLite for this task and it just turns into a maint nightmare to have to update and test the table defs each time we make a change. 9) HERE IS THE BIG CHANGE: All tokens are now stored as the lower 40 bits (ie CHAR(5)) of a binary SHA1 hash. This is for the DBM and SQL storage modules. Disk space savings for DBM should be minimal, my average token size was ~9, so it's saving 4 chars on each token roughly. My actual benchmark saw a 7% space savings for DBM. However, on the SQL side, along with the switch from username to userid I saw a 40% disk space savings. This also allowed for a slight 10% or so speed up in msg scan times for SQL. 10) The database version for DBM and SQL databases has been changed to version 3. DBM databases should automatically upgrade the first time it is opened for write. A quick little sa-learn --sync should handle that. Due to the nature of having to change the on disk table structure for SQL and automatic upgrade is not possible. You'll have to wipeout and re-learn your data to upgrade to the new version, at least until the backup/restore functionality gets finished. Feel free to test at will, no promises on if it will wipe out your data and all the usual disclaimers. I'd love to hear any feedback.
> we may have been using the *scan* time > instead of the message's origination time Is there a problem when expiring on a specific time gets rid of too many tokens and expiring on that atime + 1 gets rid of too few? Can we just limit the number of items to be expired and select the ones to purge at random if the atime can't come up with a good number? Moving the atime discussion back here from sa-dev: Someone suggested dropping high bits. The problem with that is it implies that the origin of the atime units (i.e. what time is reperesented by atime==0) has to be changed to be relative to the times in the db. How about this: Make the time units and the origin configuration options. The size of atime would be defined in the schema. The current behavior would be a granularity of 1 second, an origin of Jan 1, 1970 and a schema that makes atime a 32 bit unsigned int, which will last for another 33 years or so. Setting the granularity to 2 hours and an origin of Jan 1, 2004 with 16 bit unsigned int will allow for 15 years before anything has to be done. Now here is an interesting one: A granularity of 8 minutes and using 16 bit ints will overflow after about one year. If it is acceptible to go through the entire db once a year and adjust all atimes with a new origin that makes the oldest equal to 0, that would work. If we have a "super-expiry" operation that resets the origin of all the atimes, we could leave it up to configuration options to specify atime scale, origin, word size, and frequency of super-expiry. That would allow a lower volume site to configure for 2 byte atimes that only have to be checked once a day, while a very high volume site with a global database could check more often and decide to either configure for a larger db that uses 32 bit ints or run a "super-expiry" more often. I don't see much extra cost for the extra flexibility. Comments anyone?
I said: > Make the time units and the origin configuration options Correction: I meant that the origin would be dynamic, stored in the database, not just a configuration option. It has to be adjusted by the "super-expiry" operation.
Subject: Re: RFE: Bayes optimizations How much does going from 4 bytes to 3 or 2 really buy us? 4 bytes is easy, hard to screw up, less code, and doesn't take that much space, does it? Daniel
I guess that staying with 32 bit ints would not make qa whole lot of difference in the db size, but at least at sonic.net the bottleneck with Bayes was I/O resource usage. Much of that was from updating the atime fields of every token that was selected to be used in a Bayes calculation of a message. I will concede that the bigger factor there is update frequency, which has to do with the granularity of atime, not its size. So we could stick to 32 bit standard unix time format and be safe until 2038. If we could have a configurable scaling factor for atime then it would control the granularity very simply. On the other hand, if we had an option for the origin of atime as well as one for the scale, then sites with suitable volume could configure something that would work with 16 bit ints, adjust the schema, and save a megabyte or four per user. For some places it would be worth it. This would not have to have a "super-expiry" if it is ok to require that the options be set so that the db will be able to handle atimes for the forseeable future. Without super-expiry, I think it still is not much code and hard to screw up.
What about leaving atime in 32 bit unix time format but having a configuration option to set the granularity used when deciding if a token that has been used needs to have its atime updated? That would help keep down the I/O requirements without requiring the databse to be rebuilt if someone wants to change the value of the option. The code change for this would be less. It wouldn't save any space, but it would help the I/O usage.
Subject: Re: RFE: Bayes optimizations > What about leaving atime in 32 bit unix time format but having a > configuration option to set the granularity used when deciding if a > token that has been used needs to have its atime updated? How about a only do it once in N times factor? Perhaps one of these (depending on what's easiest and works best). - update every N times - update if !(current_time % N) - update if rand(N) < 1
Subject: Re: RFE: Bayes optimizations -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK, I took a look to remind me ;) http://svn.apache.org/viewcvs.cgi/incubator/spamassassin/trunk/lib/Mail/SpamAssassin/BayesStore.pm?root=Apache-SVN Note that rev 5112, May 9 2003, was when atime was changed to refer to the message received time: 'bug 1821: we decided on a new bayes format again. this time it's token atime is based on message received time. this brought up a whole bunch of issues, but mostly required new pack/unpack, upgrade_db, expire, and a bunch of other changes. while I was milling about I fixed a few other random bugs. I don't quite remember what they are right now, so just assume everything's peachy.' Rev 5038 -- 'bugs 1523, 1666, 1775, and 1776: set a new bayes db format for 2.6x, version 1 (2.5x had version 0). atimes are now unsigned ints (32-bit) instead of a short (16-bit) so we won't get overflow/expire issues. code is now included to automatically upgrade to v1 when the db is opened for writing. v0 is still supported for readonly mode. bayes tokens are now packed in little-endian (VAX) format. this means that the same DB will work on different endian platforms.' rev 4722 -- 'fix for bug 1625. we only touch the token when syncing the journal, and the journal only has updates from scan() whereas tokens are automatically updated from learn(), so we will likely end up lowering the atime of the token when we sync. doh. :(' So I think this means we *can* use a 2-byte atime format safely, since the problems we had with that before were *definitely* caused by using message-scan and message-learn time as the atime, instead of message-received time. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAicjIQTcbUG5Y7woRAh/OAKDYl0gQQ0obM5jSipDoXVqGLLJLlwCdFtpU b3I9HZfe8PQKFDTYhhZaY94= =MT1f -----END PGP SIGNATURE-----
> How about a only do it once in N times factor? Yes, that's what I meant when I said "granularity" while leaving the format the same
Subject: Re: RFE: Bayes optimizations On Fri, Apr 23, 2004 at 06:54:20PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > So I think this means we *can* use a 2-byte atime format safely, since the > problems we had with that before were *definitely* caused by using > message-scan and message-learn time as the atime, instead of > message-received time. IIRC, my opinion on this was: yeah, we could save a byte or two per token, but I'd rather have the full 32-bit time value 1) in case we want to change the expire algorithm again, 2) in case we want to allow people to change the granularity of expiry, 3) it requires less coding, less complexity, less testing, less run-time resources to execute said code, etc. Oh, and if we wanted to change any of those, we'd pretty much have to upgrade the database data. Keeping it as the full raw 32-bit value gives us the most flexibility while not requiring a lot of resources. Doing something like hashing the tokens would save more space on average than doing something to the atime value anyway. <sort of a rant> I don't think shaving off a byte or two would really buy any I/O time anyway. IIRC from testing, the majority of the I/O usage was reading the data from disk. Since our I/O usage is relatively random, we're not able to use caching/prefetching, so the time is from having to read whole disk blocks to get our data. Disk blocks are typically 512 bytes in size. So since our data is smaller than that, we'll always have to read the whole block anyway, so there's no savings for us to have (for example) 30 bytes of data instead of 32 bytes of data. </rant>
Subject: Re: RFE: Bayes optimizations On Fri, Apr 23, 2004 at 04:59:41PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > resource usage. Much of that was from updating the atime fields of every token > that was selected to be used in a Bayes calculation of a message. I can buy that to some degree. It's the same reason you turn off atime updating on filesystems for things like proxy server and news server storage. However, as with news and proxy servers, they'll typically do expiry with data from the files themselves -- which for us _is_ the atime update. Do we have a real breakdown as to where our main I/O usage comes from? > I will concede that the bigger factor there is update frequency, which has to do > with the granularity of atime, not its size. So we could stick to 32 bit > standard unix time format and be safe until 2038. If we could have a > configurable scaling factor for atime then it would control the granularity very > simply. Well, we already have that with the current code. atime updates go into the journal which is "light" I/O usage, then periodically the journal gets synced to the DB. The code that reads in the journal caches atime updates, and then does 1 update per token instead of multiple updates per token.
Subject: Re: RFE: Bayes optimizations > to either configure for a larger db that uses 32 bit ints or run a > "super-expiry" more often. > > Comments anyone? My only thought is that without careful or clever design, the super expiry probably has to hit the entire db, or darn near, and this could result in a huge spike in disk/processor activity at some arbitrary time. If there were some reasonable/feasible way to spread that load out over some amount of time it may be less of a system hit. I think there might be a reasonable way to do this if the db had two timestamps representing times N and N+1, where +1 was the current time field modulo. You could use one bit of a timestamp field to indicate which base it was relative to. If the actual expiration is < the +1 amount of time, you should be able to do expiration pretty much at your leasure, as long as one complete pass through the db could be done in the rollover time period. Then again, maybe it only takes a few seconds to run the whole super expiry, and the hit on the db access times and processor utilization aren't worth considering. Loren
Subject: Re: RFE: Bayes optimizations On Fri, Apr 23, 2004 at 08:10:03PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > Well, we already have that with the current code. atime updates go into > the journal which is "light" I/O usage, then periodically the journal > gets synced to the DB. The code that reads in the journal caches atime > updates, and then does 1 update per token instead of multiple updates > per token. > The SQL code doesn't make use of a journal. I've contemplated a journal table that we can do fast inserts into and treat just like the journal file. However, it just hasn't been necessary and overly complicates things so I've left it at KISS. Another option could be a mass update, tok_touch_all. The only problem there is we make sure atimes are only updated if they are > the existing atime via the SQL code which wouldn't work well with a multi-row update.
Taking
Commited in r10394.
From Michael Parker 2004-04-12 15:00 > 1) Implements Sidney's tok_get_all method for SQL and DBM. Right now the SQL > version will get the tokens from the DB in chunks (100, 50, 25, 5, 1) which > needs to be benchmarked and tweaked based on what works the best. For MySQL anyway, is there a reason for attempting to cache bayes token queries? Every time the atime of a token is updated the cache for the entire bayes_token table is cleared, so queries are very rarely actually served from cache. Since these token queries don't benefit from the SQL server cache, there's no point in caching them (they'll be cleared anyway) and no need to worry about blowing away the cache (the reason behind bunches I believe). I've recently timed token queries for about 4600 messages as they are received by my mail server. Replacing the fixed bunch sizes with a while loop that queries up to 100 tokens at a time (I didn't want to exceed any maximum query lengths) has significantly decreased the amount of time token query takes for an average message (average of 193 tokens), by about 57%. Using current bunches: 2295 messages 194 tokens per message average 1.227 seconds per message 0.00630 seconds per token Using loop, up to 100 tokens at a time: 2302 messages 192 tokens per message average 0.511 seconds per message 0.00266 seconds per token Using the loop the cache is also cleared every atime update, like above with bunches. SQL_NO_CACHE could be inserted into the statement to avoid the overhead of the unused cache insertion. I'd imagine other SQL servers would behave similarly, but I'm not familiar with how other servers (Oracle, Postgres, etc) handle caching, specifically what causes a tables cache to be cleared.
Subject: Re: RFE: Bayes optimizations You have a patch? I'll try it out. There are other reasons for the bunches, not just query cache, in fact very little to do with query cache. Michael
Created attachment 2502 [details] tok_get_all without bunches Patch leaves unused code (the single_sql blocks) in place for testing.
Subject: Re: RFE: Bayes optimizations Indeed, after a little analysis it looks like we might come out ahead (by reducing the number of queries by half or more) doing dynamic versus looping over static queries. However, this bug is not the appropriate place for the discussion. So, if needed I'll open up another bug, but count on something that generates more dynamic queries going into the code soon (as soon as I can code it up and benchmark it).
Subject: Re: RFE: Bayes optimizations Interestingly enough, I don't see this same speedup. I see about 1% in one part of my benchmark (scanning via spamc/spamd) and little less than 4% in another (scanning with spamassassin). Increasing the loop to 200 helps in the spamc/spamd case but hurts in the spamassassin case. My average token size is 197. I think more statistical analysis is needed here. I'm going to go ahead and open another bug to track. Michael