Bug 5148 - [patch] make SA work with multiple pyzor servers
Summary: [patch] make SA work with multiple pyzor servers
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: 3.1.6
Hardware: Other other
: P5 enhancement
Target Milestone: 3.2.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-24 12:05 UTC by John Hein
Modified: 2006-12-11 03:42 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
patch for Pyzor.pm to pick max score from multiple pyzor servers patch None John Hein [NoCLA]
Multi-server support (cumulative) patch for Pyzor.pm patch None Matija Nalis [NoCLA]
sum results from multiple pyzor servers patch None John Hein [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description John Hein 2006-10-24 12:05:43 UTC
I have a patch (see attachment) against the pyzor plugin in SpamAssassin 3.1.6.
 It loops over the responses from 'pyzor check' and picks the max.

The server that 'pyzor discover' comes up with is heavily loaded and times out a
LOT.  Milton Cyrus (on the pyzor-users mailing list) added a new pyzor server
earlier this year.  Thanks to Milton's server and this patch, SA is using pyzor
effectively again to contribute to the spam score.

If you do put more than one server in ~/.pyzor/servers, you should
also either lengthen the SA timeout (pyzor_timeout) from the default
5... or shorten the default pyzor timeout of 5.

I did the latter:

% cat ~/.pyzor/servers
82.94.255.100:24441
66.250.40.33:24441

% cat ~/.pyzor/config
[client]
ServersFile = servers
Timeout = 3

So when 66.250.40.33 times out (which it does a lot), it won't cause SA to
invalidate other legit responses from other servers by making SA time out
waiting for the slow server.


This patch has some extra debug in it and was intentionally not
re-indented in order to highlight just the substantive changes.
If it goes into SA,

- fix it to conform to the right indent/coding style and be less hacky

- possibly consider a different algorithm than "pick the max count
  from all responding servers"

- change the default plugin timeout to be a bit more than the default
  pyzor timeout.
Comment 1 John Hein 2006-10-24 12:22:09 UTC
Created attachment 3726 [details]
patch for Pyzor.pm to pick max score from multiple pyzor servers
Comment 2 Matija Nalis 2006-10-27 11:26:37 UTC
Created attachment 3732 [details]
Multi-server support (cumulative) patch for Pyzor.pm

This is my version of patch, which uses multiple servers in cumulative score
mode. 

It is IMHO preffered, as pyzor servers are _not_ synced, and if one is more
used then others then others are effectively ignored with original patch. 

Any user can modify pyzor_max depending on the percieved usage of secondary
server, thus accounting for incresed pyzor scores.

It also resets $pyzor_count to 0 if message is whitelisted as original pyzor
plugin code does (to stop false/invalid reports)
Comment 3 John Hein 2006-10-27 11:43:42 UTC
Yep.  That is one of the "different algorithms" I thought about.

It's hard to say which is better.  In the current state of affairs (two known
public servers which don't sync with each other, one which seldom responds), the
"sum" approach is probably better.

I'll leave it at that.

+1
Comment 4 Justin Mason 2006-12-05 05:37:58 UTC
btw, I'm not a fan of the "sum" approach.  Pyzor is already too time-consuming
as a rule; doubling the potential lookup time is not a good thing.

I'd prefer to apply John's patch -- but is it worth doing this, given this issue:

'It is IMHO preffered, as pyzor servers are _not_ synced, and if one is more
used then others then others are effectively ignored with original patch.'
Comment 5 John Hein 2006-12-05 10:16:23 UTC
Neither patch should consume more time.  They are both designed to accomodate
the situation where you have two servers listed in ~/.pyzor/servers.  The pyzor
software will try to contact both servers regardless.

No, the pyzor servers are not sync'd at this point, although there is some
working going on to do that (see pyzor-users list discussions in the past few
months).  Once that is in place, it reduces the need for having more than one
server in your pyzor config.

In the mean time, this patch is useful, particularly since the "official" server
retrieved via 'pyzor discover' is not reliable.  The timeouts when talking to
that server are probably the main reason pyzor is, "time-consuming".  You can
lower that timeout in the pyzor config (I have done so since it almost always
seems to respond within a second or two or not at all - the five second default
is probably too high).

So to address the "sum vs. max (or even avg)" behavior of the patch...  I don't
think it matters in terms of time.  Summing makes the most sense if most people
are reporting to only one server.  Max or averaging would make the most sense if
most people are reporting to all servers.

At this point, I would think that we should start out summing.  Then as more
people report to N servers, switch to max or avg.  Or make it configurable
(probably not worth the effort).  Of course, it's hard to tell if most people
are reporting to just one server or not.  However, with the current
implementation in SA, if you report to more than one, SA gets confused.

So, I would guess that for most SA users, they either use one server or never
get scores from pyzor.  So assuming that most SA users use one pyzor server
would be the safe bet for now.  And that's why, if I had to choose, I would pick
the summing method at this time (and maybe put some explanatory comments in the
code).

It shouldn't affect whether pyzor takes more time or not.  But you may want to
lower the default timeout for pyzor.  As I said above, 2 seconds is probably
fine, but that's subjective - I haven't run a scientific test.

If you want a better patch (comments, lower default timeout) and don't want to
do much work beyond review/commit, let me know.
Comment 6 Justin Mason 2006-12-05 10:36:04 UTC
ok. if it doesn't effect runtime, that's good.

'If you want a better patch (comments, lower default timeout) and don't want to
do much work beyond review/commit, let me know.'

that would be great. ;)
Comment 7 John Hein 2006-12-05 10:41:47 UTC
p.s. I know I recommended that we raise the default timeout in my original
description.  So part of my last message might seem contradictory.

I'd say it's best to have a default pyzor timeout (in the pyzor config, not SA)
of say, 2 or 3.  This timeout is per server.

So in order not to short circuit that timeout, it's best for the pyzor_timeout
in SA to be something like N * pyzor's timeout + 1  (N because the pyzor lookups
are not threaded, but end to end).

Unfortunately, pyzor code doesn't take this config setting on the command line.
 So SA has no way of knowing pyzor's timeout without reading the pyzor config.

Maybe it's just best to document a recommended setup.  My recommendation would
be 2 or 3 seconds for the pyzor (not SA) timeout and for SA's pyzor_timeout, use
3 or 4 seconds (and document how that interacts with pyzor's config).  This
would accomodate one pyzor server timing out.   Document that bumping up the
overall SA pyzor_timeout might be useful the more pyzor servers you use.

What we really want is to check 1 or 2 very reliable pyzor servers and report to
as many as possible.  You could do this with command line args to pyzor pointing
it at different config/server files.  But now we're beyond the intent of my
original patch (to just get SA working) and may be overcome by events when
people get together and sync pyzor servers (ala DCC perhaps).
Comment 8 John Hein 2006-12-05 10:44:44 UTC
jm wrote:
 > that would be great. ;)

Okay.  I'll come up with a new patch that uses the summing method and adds
comments/docs.  Gimme a day.

Then you can review / clean up style / commit.
Thanks for shepherding this.
Comment 9 John Hein 2006-12-07 00:29:33 UTC
Created attachment 3766 [details]
sum results from multiple pyzor servers

Okay... maybe a couple days.

This patch does the following:

 - sum results from responses from all pyzor servers

 - document timeout better (including refs to Pyzor's own timeout)

 - lower overall Pyzor timeout a bit - my testing shows that a pyzor
   server responds in less than a second (on a T1) or not at all.
Comment 10 Justin Mason 2006-12-08 07:20:40 UTC
looks good: will apply this later, when I get more tuits.

so, just to check -- the pyzor script will check multiple pyzor servers *in
parallel*, right?  not in serial?

just want to make sure I have the facts straight. ;)
Comment 11 John Hein 2006-12-08 09:41:30 UTC
The patch will not enforce whether Pyzor looks up the server in series or parallel.

Pyzor itself is where that is done.  And as of the version I have (using FreeBSD
pkg rev pyzor-0.4.0_4), it is doing the lookups in series.

It's up to the user to put more than one server in his .pyzor/servers file.  If
he does so and multiple servers are regularly not very responsive, he should
reconsider the pyzor server config.

This patch just keeps SA from not working at all if there are more than one
response from 'pyzor check'.  Before this patch, you could get two or more good
responses in less than a second, but the SA plugin would not understand the
response(s), throw up its hand and use _none_ of the legit responses.

IMO, since most people are probably using the official 'pyzor discover'
published server (66.250.40.33) it will typically be just as slow whether you
add another server or not.  Especially with the default timeout settings (5 per
server for pyzor, and 5 total for SA).

I have these two servers in my config:

82.94.255.100:24441
66.250.40.33:24441

The former is almost always responsive (that could change of course).

If both respond when doing 'pyzor check', they do so typically in
a total of half a second or less.  If any one is not responding, you will hit
timeouts.  As I said, I think it's better to set the per-server timeout lower in
pyzor and the overall SA timeout just a bit larger.

This will all probably be much better if and when server syncing is added and
the world gets a few more servers.

We could add to the docs that I updated something like:

"As of this writing, Pyzor communicates with servers in series, not in parallel."

If I find a round tuit, I'll send it along.
Comment 12 Justin Mason 2006-12-11 03:42:43 UTC
ok, applied to 3.2.0

: jm 252...; svnc "bug 5148: fix Pyzor plugin to support lookups against
multiple servers, summing their results.  improve Pyzor docs regarding timeouts,
and lower the default timeout to 3.5 seconds.  thanks to John Hein <jhein at
timing.com>" lib/Mail/SpamAssassin/Plugin/Pyzor.pm
Sending        lib/Mail/SpamAssassin/Plugin/Pyzor.pm
Transmitting file data .
Committed revision 485624.