3852 – two-level system/user configs, improve speed and memory performance

Bug 3852 - two-level system/user configs, improve speed and memory performance

Summary: two-level system/user configs, improve speed and memory performance

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	3.0.0
Hardware:	Other other

Importance:	P5 normal
Target Milestone:	3.3.0
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	3839
	Show dependency tree

Reported:	2004-09-30 17:42 UTC by Justin Mason
Modified:	2007-12-12 13:26 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
demo	patch			None	Justin Mason
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Justin Mason 2004-09-30 17:42:19 UTC

OK, here's something that in my opinion we need to do for 3.1.0. Currently, we
have this situation:

    - spamd daemon starts
    - creates $spamtest Mail::SpamAssassin object
    - daemon reads system-wide configuration (rules, scores, description,
      plugins, lots more) into $spamtest->{conf}.  this is very heavyweight
    - clones the current {conf} object to a "backup"
      using Mail::SpamAssassin::copy_config()
    - forks N processes:
      - foreach request:
        - accepts request from user
        - gets username
        - reads config into $spamtest->{conf}
        - processes request
        - returns scan results
        - clones the "backup" into $spamtest->{conf}, overwriting any
          per-user changes

This is very inefficient.  Although I haven't yet benchmarked it, the size of
the {conf} data structure, with all its rules, scores, et al is huge... and I
suspect this is probably the main reason for bug 3839, the "80% of spamd child
memory is unshared" bug.

I suggest we move to this model instead:

    - spamd daemon starts
    - creates $spamtest Mail::SpamAssassin object
    - daemon reads system-wide configuration (rules, scores, description,
      plugins, lots more) into $spamtest->{conf}->{system}.
    - forks N processes:
      - foreach request:
        - accepts request from user
        - gets username
        - reads config into $spamtest->{conf}->{user}.
        - processes request
        - returns scan results
        - deletes $spamtest->{conf}->{user}.

The key change would be:

pro:
  - the "clone" operation and all copying it performs using Storable, would no
longer be required
  - churning through memory should be reduced; I have a feeling this has a lot
to do with that unshared-memory bug

con:
  - however, we'll have to change code that *accesses* config items to use
another method to get that data.

What I'm suggesting is the following.

  - we implement the "dual-layer" model as above.

  - Most scalar accesses to $conf->{config_item_name} become accesses to
$conf->get(ID).   ID is a numeric constant exported by Constants.pm. (why use
numeric constants?  Simple -- looking up values in a hashref using string
hash-keys is slower than looking up values in an arrayref using integer indexes.)

  - More complex accesses must become double-layer aware, so that they can
iterate over hashes and hashes-of-hashes without requiring they be copied first.

  - there should be another API method with get()-like semantics for plugins,
which supports looking up config values using string keys, in a hash, so that
plugins have a sensible method to lookup whatever configuration settings they
too have set.  (reason: we can't know in advance what config settings plugins
use, and it'd be very inflexible to require they register ID-code namespace in
advance.  so let them use strings.)


The downside is that reads of the config data would possibly be slower -- but I
think we may be able to ameliorate that with judicious caching; and I'll bet
that it won't be as slow as the overall time when the Storable copy_config()
operation is included.

Comment 1 Justin Mason 2004-09-30 17:42:34 UTC

thinking for 3.1.0

Comment 2 Justin Mason 2004-10-05 16:13:43 UTC

'  - Most scalar accesses to $conf->{config_item_name} become accesses to
$conf->get(ID).   ID is a numeric constant exported by Constants.pm. (why use
numeric constants?  Simple -- looking up values in a hashref using string
hash-keys is slower than looking up values in an arrayref using integer indexes.)'

Better idea.  we can actually do similar to what PerMsgStatus does, for speed;
precompile the accessor functions. when we parse the list of available config
items, note the ones that are scalars.  foreach item: generate accessor method
like so:

        sub cf_$ITEMNAME {
          my ($self) = @_;
          return $self->{sys}{$ITEMNAME};
        }

in calling code, call like so:

        my $var = $conf->cf_ITEMNAME();

the benefit -- it allows perl to precalculate the hashes, which gives about a
6-7% speedup.  For the $var->$conf->cf_get("ITEMNAME") case, you have:

    1. look up "cf_get" in $conf package stash (prehashed)
    2. look up "ITEMNAME" in $conf->{sys} hash (this is not prehashed)
    3. return value

in the new case:

    1. look up "cf_get_ITEMNAME" in $conf package stash (prehashed)
    2. look up "ITEMNAME" in $conf->{sys} hash (prehashed)
    3. return value

still thinking about this, though.  but it looks like a good speedup... it
should also help with RAM, too, I think, since hash keys use a single global
refcounted pool of string singletons, so multiple refs to {name_of_config_item}
as a hash key are stored as only 1 string internally.

Comment 3 Theo Van Dinter 2005-03-12 08:35:26 UTC

We both independently have been thinking about how to do this.  I came up with
basically the same design, but the problem always comes up of how to deal with
the second user config space.  We don't want to ever touch the shared system
conf space, so all work has to happen in the user area.

So at some point, after setting the system config in place, there needs to be
some flag set for the Conf object that says any further config access should go
through the secondary user conf.

When that happens, both config option setting and retrieving need to go through
a new set of functions (as opposed to just accessing the values).

Scalar values are easy.

Array values need to have the system array copied into the user space, then
modified.  This handles both adds and deletes (think unwhitelist_from, etc.)

Hash values have the same issue as arrays.

I'm not sure, but doing the Storable/clone thing to copy arrays and hashes over
may be the way to go.

Score and rule arrays only ever need additions.  Users can't remove scores or
rules, so we could pretty easily concat the user and system space together. 
Since these are the most (IIRC) complex data structures we have in the conf
object, we should probably treat them specially.

Other issues are: this changes the "API", such that former plugins and such will
only ever know how to modify the Conf object directly.  I would deal with this
by leaving the system config in the Conf object, and the user config somewhere
else.  That way, worst case, the plugins/etc won't be changing things out in the
middle of nowhere.

Comment 4 Loren Wilton 2005-03-12 15:12:54 UTC

Subject: Re:  better way to have user configs

I have no idea if it is an applicable concept in Perl, but I know how I
would trivially solve this in C++.  A derived class.

There would be a base Config class that got populated with the system
configuration.
Once I got to the (first/next) user configuration, I'd invent a subclass of
the main config class and link it to the main config class.  The subclass
would have identical interfaces tot he base class.

I'd then start populating options in to the derived class.

All requests for values would be sent to the derived class.  If it had a
value (a user override) it would supply that.  If not it would defer to the
base class to return a value.

In actual fact this wouldn't even be a base/subclass, it would be a single
class definition that could be have linked instances.  If an instance was
queried and couldn't supply a value, but was linked to another instance, it
would pass along the request to that instance.  If it couldn't supply a
value and didn't have another linked instance, it would either return a
knwon default or failure, depending on the general design or the specific
value being queried.

Obviously you can have N arbitrary user configs in memory at the same time,
all treed to the base config.  You only need a request distributor shell
with an array or hash of pointers to the user config classes so that it can
direct the current request to the current user class.  Or you can have one
user config in memory, populate it at the start of processing for that user,
then delete it at the end of processing and create a config for the next
user.

The nice thing about this is you can treat all requests the same.  You can
always direct config requests to a user config object.  If there is no
specific user config, it would contain no values, and would defer to the
main config it was linked to.

Note that you could also have multiple system config objects and select
which one to chain the user objects to.  For instance, you could have one
per scoreset if you wanted to.

Comment 5 Justin Mason 2005-03-21 17:35:58 UTC

btw, harking back to Theo's last comment --

'Array values need to have the system array copied into the user space, then
modified.  This handles both adds and deletes (think unwhitelist_from, etc.)
Hash values have the same issue as arrays.'

that is very slow when switching users; performing lookups in both hashes is
more efficient overall.

also, IMO we'd be better off without Storable.  there are certainly bugs in that
code, I'm almost certain it's behind the hyperthreading hangs --  and I think
we're stressing it more than anyone else is.

btw I have a patch that implements all this, but since it's part of my McAfee
work, it's stuck waiting for approval there. :(

Comment 6 Theo Van Dinter 2005-03-21 18:18:32 UTC

Subject: Re:  better way to have user configs

On Mon, Mar 21, 2005 at 05:35:58PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 'Array values need to have the system array copied into the user space, then
> modified.  This handles both adds and deletes (think unwhitelist_from, etc.)
> Hash values have the same issue as arrays.'
> 
> that is very slow when switching users; performing lookups in both hashes is
> more efficient overall.

How do you handle deletes then?  There's no way to mark that something has
been taken out of the system array.  The hashes aren't so bad, since I don't
think we actually delete keys from hashes, but we definitely delete elements
from arrays.

> also, IMO we'd be better off without Storable.  there are certainly bugs in that
> code, I'm almost certain it's behind the hyperthreading hangs --  and I think
> we're stressing it more than anyone else is.

Agreed.

> btw I have a patch that implements all this, but since it's part of my McAfee
> work, it's stuck waiting for approval there. :(

:(

Comment 7 Justin Mason 2005-03-21 18:23:39 UTC

deletes -- "$item{foo} = undef" and test for exists() to see if something is set
in a given config level.

Comment 8 Theo Van Dinter 2005-03-21 18:37:33 UTC

Subject: Re:  better way to have user configs

On Mon, Mar 21, 2005 at 06:23:40PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> deletes -- "$item{foo} = undef" and test for exists() to see if something is set
> in a given config level.

For hashes, sure (assuming undef isn't a valid value).  How about
an array though?

@sys = ( 1, 2, 3, 4, 5 );

Now I want to delete $sys[2].  How can I specify that in a different array?
The same undef trick will not work here for various reasons.

Comment 9 Auto-Mass-Checker 2005-03-21 18:46:27 UTC

Subject: Re:  better way to have user configs 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


bugzilla-daemon@bugzilla.spamassassin.org writes:
> For hashes, sure (assuming undef isn't a valid value).  How about
> an array though?
> 
> @sys = ( 1, 2, 3, 4, 5 );
> 
> Now I want to delete $sys[2].  How can I specify that in a different array?
> The same undef trick will not work here for various reasons.

as far as I know, $sys[2] = undef; does the trick, right?
then if (defined $sys[2]) will return false to indicate deletion,
but "foreach" and scalar @sys will still indicate the right number
of elements.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCP4a8MJF5cimLx9ARAqM8AJ9uSsznd/pBL9WqGP1vs3Or4fGSQQCgn7ob
PGTOqgZTAdTZ0kBg22XvcHw=
=XsNw
-----END PGP SIGNATURE-----

Comment 10 Theo Van Dinter 2005-03-21 18:55:25 UTC

Subject: Re:  better way to have user configs

On Mon, Mar 21, 2005 at 06:46:27PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> > @sys = ( 1, 2, 3, 4, 5 );
> 
> as far as I know, $sys[2] = undef; does the trick, right?
> then if (defined $sys[2]) will return false to indicate deletion,
> but "foreach" and scalar @sys will still indicate the right number
> of elements.

Well, @sys is the system array which we don't want to change.

So we could do: $user[2] = undef, but that makes $user[0..1] = undef
as well, so now it looks like 3 things are deleted.  The only way to
make the undef thing work is to copy over the original array and then
do the undef, but of course then just splicing out the value would be
better in that situation.

Comment 11 Auto-Mass-Checker 2005-03-21 21:18:50 UTC

Subject: Re:  better way to have user configs 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> > > @sys = ( 1, 2, 3, 4, 5 );
> > 
> > as far as I know, $sys[2] = undef; does the trick, right?
> > then if (defined $sys[2]) will return false to indicate deletion,
> > but "foreach" and scalar @sys will still indicate the right number
> > of elements.
> 
> Well, @sys is the system array which we don't want to change.

sure, I was just talking about "an" array, one that happened
to be called @sys. ;)

> So we could do: $user[2] = undef, but that makes $user[0..1] = undef
> as well, so now it looks like 3 things are deleted.  The only way to
> make the undef thing work is to copy over the original array and then
> do the undef, but of course then just splicing out the value would be
> better in that situation.

why does it make 0..1 undef?

perl -e 'my @foo=qw(a b c d e); $foo[2] = undef; print join(",", @foo);'
a,b,,d,e

looks good to me!

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCP6pxMJF5cimLx9ARAlymAJ9GrqAz+LxsnCaGod7yyD20ec1OfwCggAFS
n43eU/5iU75aC08YlWjL1hQ=
=/z22
-----END PGP SIGNATURE-----

Comment 12 Theo Van Dinter 2005-03-21 22:06:01 UTC

Subject: Re:  better way to have user configs

On Mon, Mar 21, 2005 at 09:18:50PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> why does it make 0..1 undef?
> perl -e 'my @foo=qw(a b c d e); $foo[2] = undef; print join(",", @foo);'
> a,b,,d,e
> looks good to me!

Right, but the point is to have @sys be untouched, and then have any updates
go into a different area.  So:

@foo = qw(a b c d e);
$bar[2] = undef;

How cat I get (a,b,d,e) out of that?  $bar[0] and $bar[1] are undef as well
(as are $bar[4] and $bar[5] for that matter).  So we can't do a
if(defined($bar[###])) ...   We could create a vec and set 0..$# to 1 to
start, then set to 0 as appropriate.  Then put additions in a new array and
concat as necessary.

Comment 13 Auto-Mass-Checker 2005-03-22 10:49:25 UTC

Subject: Re:  better way to have user configs 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bugzilla-daemon@bugzilla.spamassassin.org writes:
> Right, but the point is to have @sys be untouched, and then have any updates
> go into a different area.  So:
> 
> @foo = qw(a b c d e);
> $bar[2] = undef;
> 
> How cat I get (a,b,d,e) out of that?  $bar[0] and $bar[1] are undef as well
> (as are $bar[4] and $bar[5] for that matter).  So we can't do a
> if(defined($bar[###])) ...   We could create a vec and set 0..$# to 1 to
> start, then set to 0 as appropriate.  Then put additions in a new array and
> concat as necessary.

I think I need a real-worldish example.  As far as I can see, the
only array type in Conf is 'bayes_ignore_header', which has no
corresponding 'bayes_unignore_header' removal function anyway,
so we may be arguing all over the place about a nonexistent
problem.

(PS: I am suggesting that code outside of Conf.pm that performs lookups
will need to change, anyway, in case that's not clear.  That's not
avoidable in my opinion.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCQGhsMJF5cimLx9ARAvbzAJwPAWNt8JM+R+uoyHqwq8kocik6oACdEC/s
mqw54j4yzTmQ/U+UIhkxvvk=
=BV//
-----END PGP SIGNATURE-----

Comment 14 Theo Van Dinter 2005-03-23 12:18:52 UTC

Subject: Re:  better way to have user configs

On Tue, Mar 22, 2005 at 10:49:26AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I think I need a real-worldish example.  As far as I can see, the
> only array type in Conf is 'bayes_ignore_header', which has no
> corresponding 'bayes_unignore_header' removal function anyway,
> so we may be arguing all over the place about a nonexistent
> problem.

Hrm.  I was going to say whitelists, but those seem to be hashes now.  Ok
never mind. <G>

Another issue then is that we'll want to make sure plugins and such know to
not do any form of array deletes, since it won't be supported.

Comment 15 Daniel Quinlan 2005-04-03 01:46:39 UTC

No way we should do this in 3.1 without risking too much instability.
Moving to 3.2 milestone.

Comment 16 Theo Van Dinter 2006-09-10 07:03:55 UTC

Just thought I'd throw this out there.

I haven't fully thought it out yet, but I had an idea about having multiple
tiers for configs.   The system config gets put onto the first tier, and then
user configs are handled on a different tier.  Tiers could be easily implemented
with arrays, so the system is [0] and the user's is [1] (there could be reasons
to have more tiers than that, hence an array instead of just 2 variables -- ie:
system config in [0], site config in [1], user config in [2], etc.)

So generally, spamassassin/sa-learn/etc just uses the first tier.  spamd would
create an empty second tier in the children before reading in the user config,
then simply throws it out (pop()) when finished.

The main issues are:

all of the code needs to stop using conf references and change to some type of
get/set functions.  get() will check the top tier, then work down until it finds
something (or not).  set() would always use the top tier.  deletes to
arrays/hashes would happen somehow, and generally arrays and hashes would have
to be copied to the higher tier for changes to take effect.

Comment 17 Justin Mason 2006-09-12 09:43:39 UTC

Theo, you're spot on -- the method you describe would work fine.  Note that the
get/set methods can be created automatically, on-the-fly, as described in
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3852#c2 .  That makes
things cleaner.

Comment 18 Justin Mason 2006-12-05 05:51:59 UTC

moving off to 3.3.0, I doubt it's going to happen for 3.2.0... feel free to
retarget, of course

Comment 19 Justin Mason 2007-04-16 05:11:59 UTC

noting dependency; this would be a good way to improve memory performance (bug 3839)

Comment 20 Justin Mason 2007-07-12 15:36:04 UTC

Created attachment 4047 [details]
demo

here's an idea of where I'm going with this; basically, this creates accessor
methods automatically for the config settings with a 'type' defined.

Then calling code instead of using $conf->{foo_bar}, uses $conf->cf_foo_bar().

Comment 21 Justin Mason 2007-08-03 08:34:20 UTC

ok, I should add some comments based on how far I got.

There's a branch in SVN now at
https://svn.apache.org/repos/asf/spamassassin/branches/jm_bug_3852_two_level_configs
which contains an implementation of what was described; however, it's
turned out to be a nightmare, particularly for plugins :(  Having to
understand the new Conf object structure, with two tiers at $conf->{tiers}->[0]
and $conf->{tiers}->[1], means that no plugins will do the right thing
without code changes. I'd prefer to avoid thatm since it's bad news
for third party plugins.

Instead, I have an idea, which involves turning the Conf object into
a tied object, presenting a hash-like UI in front of two tier hashes.
This way, plugins (and other code) can read and write to the Conf
object with minimal changes.  still need to think more about how
this would work, though.

Comment 22 Justin Mason 2007-08-21 06:10:49 UTC

well, the code is progressing; it now handles two-tier
configuration for most of the main part of the code, and passes most
of the test suite.

Unfortunately, this is coming with a hefty speed hit. 
SVN trunk, mass-check of ~220 msgs: 
real    1m15.214s
user    0m23.093s
sys     0m0.332s

in the bug_3852 branch:
real    1m30.468s
user    0m33.762s
sys     0m0.272s

that's a 20% slowdown -- and mass-check doesn't even switch in/out
user configs at all.   not good!

I don't think it's worth it; even duplicating the entire Conf
object around, as we currently do, is faster than the overhead
this way involves.  I think I'll shelve the project for now.

I'll leave the branch there in case anyone wants to take a look.

Comment 23 Justin Mason 2007-08-21 06:11:58 UTC

(In reply to comment #22)
> well, the code is progressing; it now handles two-tier
> configuration for most of the main part of the code, and passes most
> of the test suite.

I should point out, that's using the tie magic I was talking about
in comment #21.

Comment 24 Justin Mason 2007-12-12 13:26:32 UTC

I'm closing this bug as WONTFIX; it hasn't worked out.