SA Bugzilla – Bug 3852
two-level system/user configs, improve speed and memory performance
Last modified: 2007-12-12 13:26:32 UTC
OK, here's something that in my opinion we need to do for 3.1.0. Currently, we have this situation: - spamd daemon starts - creates $spamtest Mail::SpamAssassin object - daemon reads system-wide configuration (rules, scores, description, plugins, lots more) into $spamtest->{conf}. this is very heavyweight - clones the current {conf} object to a "backup" using Mail::SpamAssassin::copy_config() - forks N processes: - foreach request: - accepts request from user - gets username - reads config into $spamtest->{conf} - processes request - returns scan results - clones the "backup" into $spamtest->{conf}, overwriting any per-user changes This is very inefficient. Although I haven't yet benchmarked it, the size of the {conf} data structure, with all its rules, scores, et al is huge... and I suspect this is probably the main reason for bug 3839, the "80% of spamd child memory is unshared" bug. I suggest we move to this model instead: - spamd daemon starts - creates $spamtest Mail::SpamAssassin object - daemon reads system-wide configuration (rules, scores, description, plugins, lots more) into $spamtest->{conf}->{system}. - forks N processes: - foreach request: - accepts request from user - gets username - reads config into $spamtest->{conf}->{user}. - processes request - returns scan results - deletes $spamtest->{conf}->{user}. The key change would be: pro: - the "clone" operation and all copying it performs using Storable, would no longer be required - churning through memory should be reduced; I have a feeling this has a lot to do with that unshared-memory bug con: - however, we'll have to change code that *accesses* config items to use another method to get that data. What I'm suggesting is the following. - we implement the "dual-layer" model as above. - Most scalar accesses to $conf->{config_item_name} become accesses to $conf->get(ID). ID is a numeric constant exported by Constants.pm. (why use numeric constants? Simple -- looking up values in a hashref using string hash-keys is slower than looking up values in an arrayref using integer indexes.) - More complex accesses must become double-layer aware, so that they can iterate over hashes and hashes-of-hashes without requiring they be copied first. - there should be another API method with get()-like semantics for plugins, which supports looking up config values using string keys, in a hash, so that plugins have a sensible method to lookup whatever configuration settings they too have set. (reason: we can't know in advance what config settings plugins use, and it'd be very inflexible to require they register ID-code namespace in advance. so let them use strings.) The downside is that reads of the config data would possibly be slower -- but I think we may be able to ameliorate that with judicious caching; and I'll bet that it won't be as slow as the overall time when the Storable copy_config() operation is included.
thinking for 3.1.0
' - Most scalar accesses to $conf->{config_item_name} become accesses to $conf->get(ID). ID is a numeric constant exported by Constants.pm. (why use numeric constants? Simple -- looking up values in a hashref using string hash-keys is slower than looking up values in an arrayref using integer indexes.)' Better idea. we can actually do similar to what PerMsgStatus does, for speed; precompile the accessor functions. when we parse the list of available config items, note the ones that are scalars. foreach item: generate accessor method like so: sub cf_$ITEMNAME { my ($self) = @_; return $self->{sys}{$ITEMNAME}; } in calling code, call like so: my $var = $conf->cf_ITEMNAME(); the benefit -- it allows perl to precalculate the hashes, which gives about a 6-7% speedup. For the $var->$conf->cf_get("ITEMNAME") case, you have: 1. look up "cf_get" in $conf package stash (prehashed) 2. look up "ITEMNAME" in $conf->{sys} hash (this is not prehashed) 3. return value in the new case: 1. look up "cf_get_ITEMNAME" in $conf package stash (prehashed) 2. look up "ITEMNAME" in $conf->{sys} hash (prehashed) 3. return value still thinking about this, though. but it looks like a good speedup... it should also help with RAM, too, I think, since hash keys use a single global refcounted pool of string singletons, so multiple refs to {name_of_config_item} as a hash key are stored as only 1 string internally.
We both independently have been thinking about how to do this. I came up with basically the same design, but the problem always comes up of how to deal with the second user config space. We don't want to ever touch the shared system conf space, so all work has to happen in the user area. So at some point, after setting the system config in place, there needs to be some flag set for the Conf object that says any further config access should go through the secondary user conf. When that happens, both config option setting and retrieving need to go through a new set of functions (as opposed to just accessing the values). Scalar values are easy. Array values need to have the system array copied into the user space, then modified. This handles both adds and deletes (think unwhitelist_from, etc.) Hash values have the same issue as arrays. I'm not sure, but doing the Storable/clone thing to copy arrays and hashes over may be the way to go. Score and rule arrays only ever need additions. Users can't remove scores or rules, so we could pretty easily concat the user and system space together. Since these are the most (IIRC) complex data structures we have in the conf object, we should probably treat them specially. Other issues are: this changes the "API", such that former plugins and such will only ever know how to modify the Conf object directly. I would deal with this by leaving the system config in the Conf object, and the user config somewhere else. That way, worst case, the plugins/etc won't be changing things out in the middle of nowhere.
Subject: Re: better way to have user configs I have no idea if it is an applicable concept in Perl, but I know how I would trivially solve this in C++. A derived class. There would be a base Config class that got populated with the system configuration. Once I got to the (first/next) user configuration, I'd invent a subclass of the main config class and link it to the main config class. The subclass would have identical interfaces tot he base class. I'd then start populating options in to the derived class. All requests for values would be sent to the derived class. If it had a value (a user override) it would supply that. If not it would defer to the base class to return a value. In actual fact this wouldn't even be a base/subclass, it would be a single class definition that could be have linked instances. If an instance was queried and couldn't supply a value, but was linked to another instance, it would pass along the request to that instance. If it couldn't supply a value and didn't have another linked instance, it would either return a knwon default or failure, depending on the general design or the specific value being queried. Obviously you can have N arbitrary user configs in memory at the same time, all treed to the base config. You only need a request distributor shell with an array or hash of pointers to the user config classes so that it can direct the current request to the current user class. Or you can have one user config in memory, populate it at the start of processing for that user, then delete it at the end of processing and create a config for the next user. The nice thing about this is you can treat all requests the same. You can always direct config requests to a user config object. If there is no specific user config, it would contain no values, and would defer to the main config it was linked to. Note that you could also have multiple system config objects and select which one to chain the user objects to. For instance, you could have one per scoreset if you wanted to.
btw, harking back to Theo's last comment -- 'Array values need to have the system array copied into the user space, then modified. This handles both adds and deletes (think unwhitelist_from, etc.) Hash values have the same issue as arrays.' that is very slow when switching users; performing lookups in both hashes is more efficient overall. also, IMO we'd be better off without Storable. there are certainly bugs in that code, I'm almost certain it's behind the hyperthreading hangs -- and I think we're stressing it more than anyone else is. btw I have a patch that implements all this, but since it's part of my McAfee work, it's stuck waiting for approval there. :(
Subject: Re: better way to have user configs On Mon, Mar 21, 2005 at 05:35:58PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > 'Array values need to have the system array copied into the user space, then > modified. This handles both adds and deletes (think unwhitelist_from, etc.) > Hash values have the same issue as arrays.' > > that is very slow when switching users; performing lookups in both hashes is > more efficient overall. How do you handle deletes then? There's no way to mark that something has been taken out of the system array. The hashes aren't so bad, since I don't think we actually delete keys from hashes, but we definitely delete elements from arrays. > also, IMO we'd be better off without Storable. there are certainly bugs in that > code, I'm almost certain it's behind the hyperthreading hangs -- and I think > we're stressing it more than anyone else is. Agreed. > btw I have a patch that implements all this, but since it's part of my McAfee > work, it's stuck waiting for approval there. :( :(
deletes -- "$item{foo} = undef" and test for exists() to see if something is set in a given config level.
Subject: Re: better way to have user configs On Mon, Mar 21, 2005 at 06:23:40PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > deletes -- "$item{foo} = undef" and test for exists() to see if something is set > in a given config level. For hashes, sure (assuming undef isn't a valid value). How about an array though? @sys = ( 1, 2, 3, 4, 5 ); Now I want to delete $sys[2]. How can I specify that in a different array? The same undef trick will not work here for various reasons.
Subject: Re: better way to have user configs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bugzilla-daemon@bugzilla.spamassassin.org writes: > For hashes, sure (assuming undef isn't a valid value). How about > an array though? > > @sys = ( 1, 2, 3, 4, 5 ); > > Now I want to delete $sys[2]. How can I specify that in a different array? > The same undef trick will not work here for various reasons. as far as I know, $sys[2] = undef; does the trick, right? then if (defined $sys[2]) will return false to indicate deletion, but "foreach" and scalar @sys will still indicate the right number of elements. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFCP4a8MJF5cimLx9ARAqM8AJ9uSsznd/pBL9WqGP1vs3Or4fGSQQCgn7ob PGTOqgZTAdTZ0kBg22XvcHw= =XsNw -----END PGP SIGNATURE-----
Subject: Re: better way to have user configs On Mon, Mar 21, 2005 at 06:46:27PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > @sys = ( 1, 2, 3, 4, 5 ); > > as far as I know, $sys[2] = undef; does the trick, right? > then if (defined $sys[2]) will return false to indicate deletion, > but "foreach" and scalar @sys will still indicate the right number > of elements. Well, @sys is the system array which we don't want to change. So we could do: $user[2] = undef, but that makes $user[0..1] = undef as well, so now it looks like 3 things are deleted. The only way to make the undef thing work is to copy over the original array and then do the undef, but of course then just splicing out the value would be better in that situation.
Subject: Re: better way to have user configs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > > > @sys = ( 1, 2, 3, 4, 5 ); > > > > as far as I know, $sys[2] = undef; does the trick, right? > > then if (defined $sys[2]) will return false to indicate deletion, > > but "foreach" and scalar @sys will still indicate the right number > > of elements. > > Well, @sys is the system array which we don't want to change. sure, I was just talking about "an" array, one that happened to be called @sys. ;) > So we could do: $user[2] = undef, but that makes $user[0..1] = undef > as well, so now it looks like 3 things are deleted. The only way to > make the undef thing work is to copy over the original array and then > do the undef, but of course then just splicing out the value would be > better in that situation. why does it make 0..1 undef? perl -e 'my @foo=qw(a b c d e); $foo[2] = undef; print join(",", @foo);' a,b,,d,e looks good to me! - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFCP6pxMJF5cimLx9ARAlymAJ9GrqAz+LxsnCaGod7yyD20ec1OfwCggAFS n43eU/5iU75aC08YlWjL1hQ= =/z22 -----END PGP SIGNATURE-----
Subject: Re: better way to have user configs On Mon, Mar 21, 2005 at 09:18:50PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > why does it make 0..1 undef? > perl -e 'my @foo=qw(a b c d e); $foo[2] = undef; print join(",", @foo);' > a,b,,d,e > looks good to me! Right, but the point is to have @sys be untouched, and then have any updates go into a different area. So: @foo = qw(a b c d e); $bar[2] = undef; How cat I get (a,b,d,e) out of that? $bar[0] and $bar[1] are undef as well (as are $bar[4] and $bar[5] for that matter). So we can't do a if(defined($bar[###])) ... We could create a vec and set 0..$# to 1 to start, then set to 0 as appropriate. Then put additions in a new array and concat as necessary.
Subject: Re: better way to have user configs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bugzilla-daemon@bugzilla.spamassassin.org writes: > Right, but the point is to have @sys be untouched, and then have any updates > go into a different area. So: > > @foo = qw(a b c d e); > $bar[2] = undef; > > How cat I get (a,b,d,e) out of that? $bar[0] and $bar[1] are undef as well > (as are $bar[4] and $bar[5] for that matter). So we can't do a > if(defined($bar[###])) ... We could create a vec and set 0..$# to 1 to > start, then set to 0 as appropriate. Then put additions in a new array and > concat as necessary. I think I need a real-worldish example. As far as I can see, the only array type in Conf is 'bayes_ignore_header', which has no corresponding 'bayes_unignore_header' removal function anyway, so we may be arguing all over the place about a nonexistent problem. (PS: I am suggesting that code outside of Conf.pm that performs lookups will need to change, anyway, in case that's not clear. That's not avoidable in my opinion.) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFCQGhsMJF5cimLx9ARAvbzAJwPAWNt8JM+R+uoyHqwq8kocik6oACdEC/s mqw54j4yzTmQ/U+UIhkxvvk= =BV// -----END PGP SIGNATURE-----
Subject: Re: better way to have user configs On Tue, Mar 22, 2005 at 10:49:26AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I think I need a real-worldish example. As far as I can see, the > only array type in Conf is 'bayes_ignore_header', which has no > corresponding 'bayes_unignore_header' removal function anyway, > so we may be arguing all over the place about a nonexistent > problem. Hrm. I was going to say whitelists, but those seem to be hashes now. Ok never mind. <G> Another issue then is that we'll want to make sure plugins and such know to not do any form of array deletes, since it won't be supported.
No way we should do this in 3.1 without risking too much instability. Moving to 3.2 milestone.
Just thought I'd throw this out there. I haven't fully thought it out yet, but I had an idea about having multiple tiers for configs. The system config gets put onto the first tier, and then user configs are handled on a different tier. Tiers could be easily implemented with arrays, so the system is [0] and the user's is [1] (there could be reasons to have more tiers than that, hence an array instead of just 2 variables -- ie: system config in [0], site config in [1], user config in [2], etc.) So generally, spamassassin/sa-learn/etc just uses the first tier. spamd would create an empty second tier in the children before reading in the user config, then simply throws it out (pop()) when finished. The main issues are: all of the code needs to stop using conf references and change to some type of get/set functions. get() will check the top tier, then work down until it finds something (or not). set() would always use the top tier. deletes to arrays/hashes would happen somehow, and generally arrays and hashes would have to be copied to the higher tier for changes to take effect.
Theo, you're spot on -- the method you describe would work fine. Note that the get/set methods can be created automatically, on-the-fly, as described in http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3852#c2 . That makes things cleaner.
moving off to 3.3.0, I doubt it's going to happen for 3.2.0... feel free to retarget, of course
noting dependency; this would be a good way to improve memory performance (bug 3839)
Created attachment 4047 [details] demo here's an idea of where I'm going with this; basically, this creates accessor methods automatically for the config settings with a 'type' defined. Then calling code instead of using $conf->{foo_bar}, uses $conf->cf_foo_bar().
ok, I should add some comments based on how far I got. There's a branch in SVN now at https://svn.apache.org/repos/asf/spamassassin/branches/jm_bug_3852_two_level_configs which contains an implementation of what was described; however, it's turned out to be a nightmare, particularly for plugins :( Having to understand the new Conf object structure, with two tiers at $conf->{tiers}->[0] and $conf->{tiers}->[1], means that no plugins will do the right thing without code changes. I'd prefer to avoid thatm since it's bad news for third party plugins. Instead, I have an idea, which involves turning the Conf object into a tied object, presenting a hash-like UI in front of two tier hashes. This way, plugins (and other code) can read and write to the Conf object with minimal changes. still need to think more about how this would work, though.
well, the code is progressing; it now handles two-tier configuration for most of the main part of the code, and passes most of the test suite. Unfortunately, this is coming with a hefty speed hit. SVN trunk, mass-check of ~220 msgs: real 1m15.214s user 0m23.093s sys 0m0.332s in the bug_3852 branch: real 1m30.468s user 0m33.762s sys 0m0.272s that's a 20% slowdown -- and mass-check doesn't even switch in/out user configs at all. not good! I don't think it's worth it; even duplicating the entire Conf object around, as we currently do, is faster than the overhead this way involves. I think I'll shelve the project for now. I'll leave the branch there in case anyone wants to take a look.
(In reply to comment #22) > well, the code is progressing; it now handles two-tier > configuration for most of the main part of the code, and passes most > of the test suite. I should point out, that's using the tie magic I was talking about in comment #21.
I'm closing this bug as WONTFIX; it hasn't worked out.