Bug 3225 - RFE: Bayes optimizations
Summary: RFE: Bayes optimizations
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.0.0
Assignee: Michael Parker
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 3208
  Show dependency tree
 
Reported: 2004-03-28 17:11 UTC by Sidney Markowitz
Modified: 2004-11-01 08:16 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Patches for SHA1 hash as CHAR(5) and tok_get_all tests patch None Sidney Markowitz [HasCLA]
Patch File patch None Michael Parker [HasCLA]
Patch File patch None Michael Parker [HasCLA]
tok_get_all without bunches patch None Daryl C. W. O'Shea [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Sidney Markowitz 2004-03-28 17:11:52 UTC
Discussions about optimizations for Bayes in the version 3.0 timeframe should be
placed in comments in this ticket, which will also provide a central place to
attach suggested patches, test, and test results.
Comment 1 Sidney Markowitz 2004-03-28 17:34:50 UTC
Created attachment 1874 [details]
Patches for SHA1 hash as CHAR(5)  and tok_get_all tests

The attached patches against SVN rev 9804

(1) replace the Bayes tokens with the low order 40 bits of their SHA-1,
allowing them to be stored as CHAR(5) BINARY;

(2) implement a tok_get_all method to be used instead of tok_get so that all
tokens in a message are selected with one query to the database.

The new code is enabled by two constants in Bayes.pm

 use constant HASH_TOKENS_FOR_DB => 1;
 use constant USE_TOK_GET_ALL_OPTIMIZATION => 1;

If HASH_TOKENS_FOR_DB is 1 the schema should specify that the token field in
the bayes_token table is CHAR(5) BINARY.

This patch defines a new SHA1 function in SHA1.pm called sha1raw that returns
Digest::sha1(), i.e., the SHA1 hash in binary. I did not bother to implement
the  slow perl version, so testing with this requires Digest::SHA1 to be
installed. If we are going to use this, someone will need to add the trivial
code to call the slow perl version that we already have in SHA1.pm.

I did some testing using the truncated SHA-1 hash stored as a BIGINT instead of
CHAR(5), but the small improvement in speed did not seem to justify the extra
disk space plus extra code needed to get a 64 bit integer into and out of the
database when perl doesn't have a 64 bit int type.
Comment 2 Sidney Markowitz 2004-03-28 17:37:13 UTC
Note that the tok_get_all in the patch queries all of the tokens extracted from
a message at once without checking if the resulting SELECT is too large for
MySQL. I did not test with Michael Parker's suggestion of querying 25 at a time.
Comment 3 Michael Parker 2004-04-02 17:20:58 UTC
Subject: Re:  RFE: Bayes optimizations

On Sun, Mar 28, 2004 at 05:37:14PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> ------- Additional Comments From sidney@sidney.com  2004-03-28 17:37 -------
> Note that the tok_get_all in the patch queries all of the tokens extracted from
> a message at once without checking if the resulting SELECT is too large for
> MySQL. I did not test with Michael Parker's suggestion of querying 25 at a time.
> 

I'm currently finishing up a fairly large hunk of storage (sql and
dbm) optimizations/changes.  One is Sidney's tok_get_all with
batching. Another is moving token_count and the newest/oldest token
age values into the bayes_vars table (to avoid some table scans).  It
implements a cache to avoid having to go to the database for multiple
items. Removing some dead code and general cleanup.

They span a fairly large range, and will require a schema change.  I
hope to have everything finished up this weekend, assuming I don't get
pulled away for something else.

One things of note, I've got what I consider a fairly decent benchmark
script now that I've been using for my testing. Hopefully I can
package it up for others to use as well.  It makes comparing changes
and different storage backends very easy.

Michael

Comment 4 Michael Parker 2004-04-12 15:00:14 UTC
Created attachment 1890 [details]
Patch File

Here is a version that does several things:

1) Implements Sidney's tok_get_all method for SQL and DBM.  Right now the SQL
version will get the tokens from the DB in chunks (100, 50, 25, 5, 1) which
needs to be benchmarked and tweaked based on what works the best.

2) Removes several full table scans to find the token_count and newest/oldest
token atimes by moving those values into the bayes_vars table.

3) Removes some code that is no longer called.

4) Adds some basic caching to avoid multiple lookups.

This patch does change the SQL database version so you can not use it without
wiping your existing data and starting from scratch (for the DB savy it is
possible to alter the bayes_vars table to add the new columns and then populate
them with the right values and bump the db version, but I'll leave that as a
lesson to the reader).	I'm hoping to get the backup/restore stuff done before
checking this in to help folks who are already using this do the upgrade
without too much grief.

Performance wise, my tests (via the benchmark) show a 2-3 times speedup from
the old code.  Compared to DBM it is about twice as slow for sa-learn
operations and statistically even for scanning.  The IO requirements should be
much smaller, and my casual testing are much lower for SQL than for DBM.

I'd love to hear some feedback from folks as to how well this works in your
setup.	Once I get some time I'd like to get folks using the benchmark I made
and hopefully extending it (for instance I'd love to start doing some
concurrent sa-learns and scanning to see how we do there).
Comment 5 Michael Parker 2004-04-12 15:06:28 UTC
Subject: Re:  RFE: Bayes optimizations

*grumble*

I forgot to bump the version number in bayes_pg.sql and
bayes_sqlite.sql, I'm not gonna bother with an updated patch at the
moment.  If you're using Postgres or SQLite you will need to update
those by hand for the moment.

Michael

Comment 6 Sidney Markowitz 2004-04-15 16:17:18 UTC
Michael, I don't want you to forget the other optimizations:

1) Use an INT(11) for username, using the UID instead of a variable character string

2) Express atime as a two byte int with granularity of one day instead of a 4
byte int with granularity of a second. That not only cuts two bytes off of the
atime in every record, but it means that the record is updated for the time no
more than once a day.

3) The more radical change, use CHAR(5) from SHA1 hash for the tokens.

Cutting down the database size by that much has to make a big difference, not to
mention the optimization of using all fixed length records and keys.
Comment 7 Michael Parker 2004-04-16 13:03:14 UTC
Subject: Re:  RFE: Bayes optimizations

On Thu, Apr 15, 2004 at 04:17:19PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From sidney@sidney.com  2004-04-15 16:17 -------
> Michael, I don't want you to forget the other optimizations:

I haven't forgotten, I was shooting for some low hanging fruit and
correct some bad design decisions (hindsight is 20/20).

> 1) Use an INT(11) for username, using the UID instead of a variable character string
> 

Pretty much done.  The plan, since not everyone will have a system
UID, is to create entries in bayes_vars per username and give them an
id (via sequence).  This requires some additional work on DBs that
don't easily support AUTO_INCREMENT but shouldn't be too bad.

I may also add a seperate table and make the mapping query
customizable to help support folks who want to map IDs to an existing
system or provide a way to limit who can use bayes (Bug 3215).

> 2) Express atime as a two byte int with granularity of one day instead of a 4
> byte int with granularity of a second. That not only cuts two bytes off of the
> atime in every record, but it means that the record is updated for the time no
> more than once a day.

I'm unsure on this one.  Anyone else have any opinions?


> 3) The more radical change, use CHAR(5) from SHA1 hash for the tokens.
> 

Close to done.  Gonna have to encode this binary data on dump and
backup and then recode it on restore, but I figured out a nice way to
handle that I think.  Will be a bummer to lose the ability to see the
raw token data, but I suppose if it makes things smaller/faster it
will be worth it.

Anyone know how well Berkeley DB handles binary keys?

I think we'll need to do a 10-fold cross-validation test after we put
this change in to make sure we didn't degrade our results.

Michael

Comment 8 Justin Mason 2004-04-16 13:53:00 UTC
Subject: Re:  RFE: Bayes optimizations 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>> 2) Express atime as a two byte int with granularity of one day instead
>> of a 4 byte int with granularity of a second. That not only cuts two
>> bytes off of the atime in every record, but it means that the record is
>> updated for the time no more than once a day.
>
>I'm unsure on this one.  Anyone else have any opinions?

I'm really unsure it's a good idea -- we did find that the precision
of the atime values was valuable for effective expiry.   (The original
rev of atime was a 2-byte int with granularity of several hours.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAgEegQTcbUG5Y7woRAsahAJsEJX5De/aNKChF8SVtckaQi5nFxwCfYhO5
izVftF+wtW/UxSIqHnSFGb0=
=Bt3j
-----END PGP SIGNATURE-----

Comment 9 Sidney Markowitz 2004-04-16 15:49:24 UTC
> we did find that the precision of the
> atime values was valuable for effective expiry.

I must be missing something. How is the time more than a rough guide to deciding
which tokens to expire? Do tokens accumulate so quickly that expiration has to
be performed more often than every two hours? What didn't work when you tried
the two-hour granularity?
Comment 10 Justin Mason 2004-04-23 02:51:18 UTC
Subject: Re:  RFE: Bayes optimizations 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> > we did find that the precision of the atime values was valuable for
> > effective expiry.
> 
> I must be missing something. How is the time more than a rough guide to
> deciding which tokens to expire? Do tokens accumulate so quickly that
> expiration has to be performed more often than every two hours? What
> didn't work when you tried the two-hour granularity?

We found that it adversely effected expiry, causing it to fail to be able
to expire due to too many messages being combined into the same "bucket",
adversely effecting expiry's ability to find a good cut-off.

Now that I think of it, though, we may have been using the *scan* time
instead of the message's origination time (from Received hdrs); this may
not be an issue with the latter...

The best thing to do here is to test it out, using mass-check I think.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAiCxDQTcbUG5Y7woRAv9lAJ9GZinMMEIfF+Uv/s76AFe19kSXwgCfURU9
M7V6PPkLRzgwl9Ka4bbukWM=
=gydk
-----END PGP SIGNATURE-----

Comment 11 Michael Parker 2004-04-23 09:40:55 UTC
Created attachment 1917 [details]
Patch File

I'm not 100% on this code yet.	I think it's pretty much code complete but I
want to run some more tests and it's been awhile since I tested anything but
MySQL with the changes.  Also, I'm seeing some cases where the SQL backend is
learning more tokens (like 25) then DBM and it's causing a slight ripple effect
in my benchmark.  No idea why that is.

Here is my current work in progress that makes the following changes to Bayes
and the DBM/SQL storage modules, most of these things are on the SQL storage
only, so unless I specify assume SQL only:

1) Implements a tok_get_all that fetches all of a msgs tokens from storage at
one time.  This was for DBM and SQL modules.  There is still some possible
tweaking on the SQL side to get the optimum number of tokens with each fetch. 
Need to gather some stats on the average number of tokens per msg.

2) The SQL storage now keeps a running tally of the number of tokens, the
newest	token age and the oldest token age, rather than doing a full table scan
each time we needed the information (practically every time we even looked at
the Bayes code).  This added some overhead to the learn process (roughly 2
times slower) but scan time went down a whole bunch (5-6 times faster in my
benchmark).

3) Slight short circuit in the expiry_due code, don't bother fetching variables
from the database if bayes_auto_expire is false.  This helps both storage
modules.

4) On the SQL side we now clean up tokens whose spam_count and ham_count are
both 0.  Previously we just ignored them and cleaned them up at expire time. 
This slows down processing on forgets because we have to do the deletes.  There
is some room for improvement in this code, see the note in the source.

5) Cache some values from the database (ie db_version).  There is probably a
lot more caching that we could do but when I went down that road performance
went down so I backed it out in favor of KISS.	I'm certain that future work in
the caching area can improve the speed of the code.

6) Dead code removed.  I think there is more dead code that can be ripped out
(ie tok_get) but I wanted to make sure first.

7) We now store all user data by an integer userid.  All columns who's key
included username have been switched over to id.  The bayes_vars table is the
map from username to id.  NOTE: this pretty much requires and
AUTO_INCREMENT/Serial column in the DB, which is doable via triggers and what
not, but requires more work for the administrator.  Everything should in theory
work out of the box for MySQL and PostgreSQL.

8) Dropped SQLite table defs from the sql dir.	Honestly, I do not believe
folks are going to use SQLite for this task and it just turns into a maint
nightmare to have to update and test the table defs each time we make a change.


9) HERE IS THE BIG CHANGE: All tokens are now stored as the lower 40 bits (ie
CHAR(5)) of a binary SHA1 hash.  This is for the DBM and SQL storage modules. 
Disk space savings for DBM should be minimal, my average token size was ~9, so
it's saving 4 chars on each token roughly.  My actual benchmark saw a 7% space
savings for DBM.  However, on the SQL side, along with the switch from username
to userid I saw a 40% disk space savings.  This also allowed for a slight 10%
or so speed up in msg scan times for SQL.

10) The database version for DBM and SQL databases has been changed to version
3.  DBM databases should automatically upgrade the first time it is opened for
write.	A quick little sa-learn --sync should handle that.  Due to the nature
of having to change the on disk table structure for SQL and automatic upgrade
is not possible.  You'll have to wipeout and re-learn your data to upgrade to
the new version, at least until the backup/restore functionality gets finished.


Feel free to test at will, no promises on if it will wipe out your data and all
the usual disclaimers.	I'd love to hear any feedback.
Comment 12 Sidney Markowitz 2004-04-23 14:21:08 UTC
> we may have been using the *scan* time
> instead of the message's origination time

Is there a problem when expiring on a specific time gets rid of too many tokens
and expiring on that atime + 1 gets rid of too few?

Can we just limit the number of items to be expired and select the ones to purge
at random if the atime can't come up with a good number?

Moving the atime discussion back here from sa-dev:

Someone suggested dropping high bits. The problem with that is it implies that
the origin of the atime units (i.e. what time is reperesented by atime==0) has
to be changed to be relative to the times in the db.

How about this: Make the time units and the origin configuration options. The
size of atime would be defined in the schema. The current behavior would be a
granularity of 1 second, an origin of Jan 1, 1970 and a schema that makes atime
a 32 bit unsigned int, which will last for another 33 years or so. Setting the
granularity to 2 hours and an origin of Jan 1, 2004 with 16 bit unsigned int
will allow for 15 years before anything has to be done.

Now here is an interesting one: A granularity of 8 minutes and using 16 bit ints
will overflow after about one year. If it is acceptible to go through the entire
db once a year and adjust all atimes with a new origin that makes the oldest
equal to 0, that would work.

If we have a "super-expiry" operation that resets the origin of all the atimes,
we could leave it up to configuration options to specify atime scale, origin,
word size, and frequency of super-expiry. That would allow a lower volume site
to configure for 2 byte atimes that only have to be checked once a day, while a
very high volume site with a global database could check more often and decide
to either configure for a larger db that uses 32 bit ints or run a
"super-expiry" more often.

I don't see much extra cost for the extra flexibility.

Comments anyone?
Comment 13 Sidney Markowitz 2004-04-23 14:30:39 UTC
I said:
> Make the time units and the origin configuration options

Correction: I meant that the origin would be dynamic, stored in the database,
not just a configuration option. It has to be adjusted by the "super-expiry"
operation.
Comment 14 Daniel Quinlan 2004-04-23 16:35:31 UTC
Subject: Re:  RFE: Bayes optimizations

How much does going from 4 bytes to 3 or 2 really buy us?

4 bytes is easy, hard to screw up, less code, and doesn't take that much
space, does it?

Daniel

Comment 15 Sidney Markowitz 2004-04-23 16:59:40 UTC
I guess that staying with 32 bit ints would not make qa whole lot of difference
in the db size, but at least at sonic.net the bottleneck with Bayes was I/O
resource usage. Much of that was from updating the atime fields of every token
that was selected to be used in a Bayes calculation of a message.

I will concede that the bigger factor there is update frequency, which has to do
with the granularity of atime, not its size. So we could stick to 32 bit
standard unix time format and be safe until 2038. If we could have a
configurable scaling factor for atime then it would control the granularity very
simply.

On the other hand, if we had an option for the origin of atime as well as one
for the scale, then sites with suitable volume could configure something that
would work with 16 bit ints, adjust the schema, and save a megabyte or four per
user. For some places it would be worth it. This would not have to have a
"super-expiry" if it is ok to require that the options be set so that the db
will be able to handle atimes for the forseeable future. Without super-expiry, I
think it still is not much code and hard to screw up.
Comment 16 Sidney Markowitz 2004-04-23 17:11:50 UTC
What about leaving atime in 32 bit unix time format but having a configuration
option to set the granularity used when deciding if a token that has been used
needs to have its atime updated?

That would help keep down the I/O requirements without requiring the databse to
be rebuilt if someone wants to change the value of the option.

The code change for this would be less. It wouldn't save any space, but it would
help the I/O usage.
Comment 17 Daniel Quinlan 2004-04-23 17:25:43 UTC
Subject: Re:  RFE: Bayes optimizations

> What about leaving atime in 32 bit unix time format but having a
> configuration option to set the granularity used when deciding if a
> token that has been used needs to have its atime updated?

How about a only do it once in N times factor?  Perhaps one of these
(depending on what's easiest and works best).

 - update every N times
 - update if !(current_time % N)
 - update if rand(N) < 1

Comment 18 Justin Mason 2004-04-23 18:54:18 UTC
Subject: Re:  RFE: Bayes optimizations 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


OK, I took a look to remind me ;)

http://svn.apache.org/viewcvs.cgi/incubator/spamassassin/trunk/lib/Mail/SpamAssassin/BayesStore.pm?root=Apache-SVN

Note that rev 5112, May 9 2003, was when atime was changed to refer to the
message received time: 'bug 1821: we decided on a new bayes format again.
this time it's token atime is based on message received time.  this
brought up a whole bunch of issues, but mostly required new pack/unpack,
upgrade_db, expire, and a bunch of other changes.  while I was milling
about I fixed a few other random bugs.  I don't quite remember what they
are right now, so just assume everything's peachy.'

Rev 5038 -- 'bugs 1523, 1666, 1775, and 1776: set a new bayes db format for
2.6x, version 1 (2.5x had version 0).  atimes are now unsigned ints (32-bit)
instead of a short (16-bit) so we won't get overflow/expire issues. code is now
included to automatically upgrade to v1 when the db is opened for writing.  v0
is still supported for readonly mode.  bayes tokens are now packed in
little-endian (VAX) format.  this means that the same DB will work on different
endian platforms.'

rev 4722 -- 'fix for bug 1625. we only touch the token when syncing the
journal, and the journal only has updates from scan() whereas tokens are
automatically updated from learn(), so we will likely end up lowering the
atime of the token when we sync. doh. :('


So I think this means we *can* use a 2-byte atime format safely, since the
problems we had with that before were *definitely* caused by using
message-scan and message-learn time as the atime, instead of
message-received time.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAicjIQTcbUG5Y7woRAh/OAKDYl0gQQ0obM5jSipDoXVqGLLJLlwCdFtpU
b3I9HZfe8PQKFDTYhhZaY94=
=MT1f
-----END PGP SIGNATURE-----

Comment 19 Sidney Markowitz 2004-04-23 18:55:02 UTC
> How about a only do it once in N times factor?

Yes, that's what I meant when I said "granularity" while leaving the format the same
Comment 20 Theo Van Dinter 2004-04-23 19:58:24 UTC
Subject: Re:  RFE: Bayes optimizations

On Fri, Apr 23, 2004 at 06:54:20PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> So I think this means we *can* use a 2-byte atime format safely, since the
> problems we had with that before were *definitely* caused by using
> message-scan and message-learn time as the atime, instead of
> message-received time.

IIRC, my opinion on this was: yeah, we could save a byte or two per
token, but I'd rather have the full 32-bit time value 1) in case we
want to change the expire algorithm again, 2) in case we want to allow
people to change the granularity of expiry, 3) it requires less coding,
less complexity, less testing, less run-time resources to execute said
code, etc.  Oh, and if we wanted to change any of those, we'd pretty
much have to upgrade the database data.

Keeping it as the full raw 32-bit value gives us the most flexibility
while not requiring a lot of resources.  Doing something like hashing
the tokens would save more space on average than doing something to the
atime value anyway.

<sort of a rant>
I don't think shaving off a byte or two would really buy any I/O time
anyway.  IIRC from testing, the majority of the I/O usage was reading the
data from disk.  Since our I/O usage is relatively random, we're not able
to use caching/prefetching, so the time is from having to read whole disk
blocks to get our data.  Disk blocks are typically 512 bytes in size.
So since our data is smaller than that, we'll always have to read the
whole block anyway, so there's no savings for us to have (for example)
30 bytes of data instead of 32 bytes of data.
</rant>

Comment 21 Theo Van Dinter 2004-04-23 20:10:02 UTC
Subject: Re:  RFE: Bayes optimizations

On Fri, Apr 23, 2004 at 04:59:41PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> resource usage. Much of that was from updating the atime fields of every token
> that was selected to be used in a Bayes calculation of a message.

I can buy that to some degree.  It's the same reason you turn off atime
updating on filesystems for things like proxy server and news server
storage.

However, as with news and proxy servers, they'll typically do expiry
with data from the files themselves -- which for us _is_ the atime update.

Do we have a real breakdown as to where our main I/O usage comes from?  

> I will concede that the bigger factor there is update frequency, which has to do
> with the granularity of atime, not its size. So we could stick to 32 bit
> standard unix time format and be safe until 2038. If we could have a
> configurable scaling factor for atime then it would control the granularity very
> simply.

Well, we already have that with the current code.  atime updates go into
the journal which is "light" I/O usage, then periodically the journal
gets synced to the DB.  The code that reads in the journal caches atime
updates, and then does 1 update per token instead of multiple updates
per token.

Comment 22 Loren Wilton 2004-04-23 20:50:27 UTC
Subject: Re:  RFE: Bayes optimizations

> to either configure for a larger db that uses 32 bit ints or run a
> "super-expiry" more often.
>
> Comments anyone?

My only thought is that without careful or clever design, the super expiry
probably has to hit the entire db, or darn near, and this could result in a
huge spike in disk/processor activity at some arbitrary time.

If there were some reasonable/feasible way to spread that load out over some
amount of time it may be less of a system hit.

I think there might be a reasonable way to do this if the db had two
timestamps representing times N and N+1, where +1 was the current time field
modulo.  You could use one bit of a timestamp field to indicate which base
it was relative to.  If the actual expiration is < the +1 amount of time,
you should be able to do expiration pretty much at your leasure, as long as
one complete pass through the db could be done in the rollover time period.

Then again, maybe it only takes a few seconds to run the whole super expiry,
and the hit on the db access times and processor utilization aren't worth
considering.

        Loren

Comment 23 Michael Parker 2004-04-23 21:21:36 UTC
Subject: Re:  RFE: Bayes optimizations

On Fri, Apr 23, 2004 at 08:10:03PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> Well, we already have that with the current code.  atime updates go into
> the journal which is "light" I/O usage, then periodically the journal
> gets synced to the DB.  The code that reads in the journal caches atime
> updates, and then does 1 update per token instead of multiple updates
> per token.
> 

The SQL code doesn't make use of a journal.  I've contemplated a
journal table that we can do fast inserts into and treat just like the
journal file.  However, it just hasn't been necessary and overly
complicates things so I've left it at KISS.

Another option could be a mass update, tok_touch_all.  The only
problem there is we make sure atimes are only updated if they are >
the existing atime via the SQL code which wouldn't work well with a
multi-row update.

Comment 24 Michael Parker 2004-04-28 14:20:43 UTC
Taking
Comment 25 Michael Parker 2004-04-28 18:04:13 UTC
Commited in r10394.
Comment 26 Daryl C. W. O'Shea 2004-11-01 16:20:25 UTC
From Michael Parker  2004-04-12 15:00
> 1) Implements Sidney's tok_get_all method for SQL and DBM.  Right now the SQL
> version will get the tokens from the DB in chunks (100, 50, 25, 5, 1) which
> needs to be benchmarked and tweaked based on what works the best.

For MySQL anyway, is there a reason for attempting to cache bayes token queries?
 Every time the atime of a token is updated the cache for the entire bayes_token
table is cleared, so queries are very rarely actually served from cache.

Since these token queries don't benefit from the SQL server cache, there's no
point in caching them (they'll be cleared anyway) and no need to worry about
blowing away the cache (the reason behind bunches I believe).

I've recently timed token queries for about 4600 messages as they are received
by my mail server.  Replacing the fixed bunch sizes with a while loop that
queries up to 100 tokens at a time (I didn't want to exceed any maximum query
lengths) has significantly decreased the amount of time token query takes for an
average message (average of 193 tokens), by about 57%.


Using current bunches:

2295 messages
194 tokens per message average
1.227 seconds per message
0.00630 seconds per token


Using loop, up to 100 tokens at a time:

2302 messages
192 tokens per message average
0.511 seconds per message
0.00266 seconds per token


Using the loop the cache is also cleared every atime update, like above with
bunches.  SQL_NO_CACHE could be inserted into the statement to avoid the
overhead of the unused cache insertion.

I'd imagine other SQL servers would behave similarly, but I'm not familiar with
how other servers (Oracle, Postgres, etc) handle caching, specifically what
causes a tables cache to be cleared.
Comment 27 Michael Parker 2004-11-01 16:34:09 UTC
Subject: Re:  RFE: Bayes optimizations

You have a patch?  I'll try it out.  There are other reasons for the
bunches, not just query cache, in fact very little to do with query
cache.

Michael
Comment 28 Daryl C. W. O'Shea 2004-11-01 17:16:47 UTC
Created attachment 2502 [details]
tok_get_all without bunches

Patch leaves unused code (the single_sql blocks) in place for testing.
Comment 29 Michael Parker 2004-11-02 08:55:19 UTC
Subject: Re:  RFE: Bayes optimizations

Indeed, after a little analysis it looks like we might come out ahead
(by reducing the number of queries by half or more) doing dynamic
versus looping over static queries.

However, this bug is not the appropriate place for the discussion.
So, if needed I'll open up another bug, but count on something that
generates more dynamic queries going into the code soon (as soon as I
can code it up and benchmark it).
Comment 30 Michael Parker 2004-11-04 10:32:49 UTC
Subject: Re:  RFE: Bayes optimizations

Interestingly enough, I don't see this same speedup.  I see about 1%
in one part of my benchmark (scanning via spamc/spamd) and little less
than 4% in another (scanning with spamassassin).  Increasing the loop
to 200 helps in the spamc/spamd case but hurts in the spamassassin
case.  My average token size is 197.  I think more statistical
analysis is needed here.  I'm going to go ahead and open another bug
to track.

Michael