3096 – [review] improve mass-check so that it can be run from a different directory

Bug 3096 - [review] improve mass-check so that it can be run from a different directory

Summary: [review] improve mass-check so that it can be run from a different directory

Status:	RESOLVED DUPLICATE of bug 2853

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Tools (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P5 enhancement
Target Milestone:	3.1.0
Assignee:	Duncan Findlay

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2004-02-26 20:16 UTC by Gary Funck
Modified:	2004-05-13 12:14 UTC (History)
CC List:	2 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Proposed patch	patch			None	Duncan Findlay
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gary Funck 2004-02-26 20:16:40 UTC

[Just in case anyone is thinking about re-orging the programs under 
the 'masses' directory, I'm passing these ideas along. If I had more time, I'd 
take a crack it, but can't do so at the moment.]

I want to run mass-check in much the same fashion that Bob Menschel documents 
in http://www.exit0.us/index.php/BobCorpusTestV2. However, I'd prefer it if I 
could keep my corpus, rules directory, user_prefs directory, and results 
directory in a directory other than SA's build directory. This is mostly 
possilbe but there are a few places where mass-check depends upon being
run within the SA hierarchy:

This is okay, because we can invoke mass-check by its full path under the build 
directory:
use lib "$FindBin::Bin/../lib";
and this is okay because we can override it with the -c option:
use lib "$FindBin::Bin/../lib";

but these are problematic:
$spamtest = new Mail::SpamAssassin ({
  'debug'                               => $opt_debug,
  'rules_filename'                      => $opt_c,
  'userprefs_filename'                  
=> "$FindBin::Bin/spamassassin/user_prefs",
  'site_rules_filename'                 
=> "$FindBin::Bin/spamassassin/local.cf",
  'userstate_dir'                       => "$FindBin::Bin/spamassassin",
because they (apparently) don't have any switches that can be used to
override those settings.

In a separate vein, many of the tools in the mass-check directory that 
calculate hit frequencies, scores and such, often obtain their raw data by 
either parsing reports written by other tools, or by for example reading in a 
file produced by parsing the rules defintioons, via a "use" directive. This all 
works but can make writing other tools somewhat more difficult because it is 
necessary to cut and various routines that parse reports, etc, from other 
program source files, and sometimes this parsing is rather specific to a 
particular program's needs and it has to be modified. Better, I think, would be 
to move the various functions for parsing rules, calculating statistics of 
various sorts, and so on, into an easier to understand and to use, module 
structure.

Comment 1 Gary Funck 2004-02-26 20:27:24 UTC

Subject: RE:  New: RFE: improve mass-check so that it can be run from a different directory, and implement module structure to support 'masses' tools




> and this is okay because we can override it with the -c option:
> use lib "$FindBin::Bin/../lib";

that should be:
$opt_c = "$FindBin::Bin/../rules";

Comment 2 Justin Mason 2004-03-17 09:15:26 UTC

related to bug 2853

Comment 3 Duncan Findlay 2004-03-22 22:11:01 UTC

Created attachment 1867 [details]
Proposed patch

Okay this patch allows mass-check to be run anywhere. If the --dist option is
given, (or it auto-detects, based on ../spamassassin.raw) it assumes it's in
its usual home, masses/. Otherwise, it should pick up the default paths for
stuff (local rules, user_prefs, etc) and it should use
~/.spamassassin/mass-check as userstate, unless overriden.

Instead of output to the traditional spam.log and ham.log, it outputs to a
single file, masses.log. (jm suggested this in bug 2853 comment 2)

The first character of the line is now s, h, N or P depending on whether the
message was correctly marked as spam, correctly marked as ham, a false negative
or false positive. The intent was that this is human readable, and useful for
quick statistics.

On IRC quinlan suggested that this might not be ideal, and perhaps two fields
would be better, or that one field could be s-c, s-m for spam-caught,
spam-missed, etc.

I chose the field the way I did based on the fact that the old scripts
generally expect a one character field, so it required less changes. Since I
intend to re-write the rest of the masses/ stuff, changing it shouldn't be a
problem. Also, I find a two field system confusing to read, I always forget
which is which ;-)

Also, this code is probably not tested as well as it should be. I'm not running
HEAD, so it's tough for me to test (not making excuses, just saying...).

Comment 4 Duncan Findlay 2004-03-22 22:14:21 UTC

implement module stuff is bug 2853.

ccing quinlan since we need to coordinate to prevent the impact on nightly-builds.

Also, assigning to me.

Comment 5 Duncan Findlay 2004-03-27 14:02:44 UTC

Is nobody going to comment on this??

If so, I'm going to have to commit.... ;-)

Comment 6 Daniel Quinlan 2004-03-27 15:55:53 UTC

> Is nobody going to comment on this??

-0.5 not quite ready yet

The patch really needs to fix all of the scripts in masses to handle the
new format, including the ones under rule-qa.

I agree with using a single output file, but I'd like the new format
(the format not the code itself) to be sufficient to handle:

- message classes other than spam and ham (both for the manual
  classification as well as the result determined by SpamAssassin).

- using "unknown" or "none" for unclassified checks (like
  "./mass-check --mbox file.mbox" with no --spam or --ham
  argument or class specification.

I propose the following format:

<manual class> <result class> <score> <id> <rules> <value pairs>

where, for our current code, <manual class> is:

  "spam" | "ham" | "none"

and <result class> is:

  "spam" | "ham"

I'd also change the score to be the precise floating point score while
we're at it (that, at least, should be an easy change).

Comment 7 Duncan Findlay 2004-03-27 18:54:21 UTC

Subject: Re:  [review] improve mass-check so that it can be run from a different directory

> -0.5 not quite ready yet

True.
 
> The patch really needs to fix all of the scripts in masses to handle the
> new format, including the ones under rule-qa.

I plan on re-writing most of the other scripts (I'm part way
there). But as far as rule-qa, I'll probably need some help sorting
through it.

> I propose the following format:
> 
> <manual class> <result class> <score> <id> <rules> <value pairs>
> 
> where, for our current code, <manual class> is:
> 
>   "spam" | "ham" | "none"
> 
> and <result class> is:
> 
>   "spam" | "ham"

I'd prefer that we stick with single characters, since that is what
ArchiveIterator does. (It passes "s" or "h" around instead of "spam"
or "ham") Furthermore, having it fixed width is a good thing imho.

> I'd also change the score to be the precise floating point score while
> we're at it (that, at least, should be an easy change).

Agreed, but all the scripts need to change too. Since I'm rewriting
them to share code, that won't be hard.

Comment 8 Justin Mason 2004-03-28 20:49:21 UTC

+1 on duncan's latest comments.

'> I propose the following format:
> <manual class> <result class> <score> <id> <rules> <value pairs>
> where, for our current code, <manual class> is:
>   "spam" | "ham" | "none"
> and <result class> is:
>   "spam" | "ham"
I'd prefer that we stick with single characters, since that is what
ArchiveIterator does. (It passes "s" or "h" around instead of "spam"
or "ham") Furthermore, having it fixed width is a good thing imho.'

what about:

<manual class><result class> <score> <id> <rules> <value pairs>

with one-letter classes.  That gives us:

hh: manually ham, classed as ham
hs: false positive
sh: false negative
ss: manually spam, classed as spam

That's handy because (a) it's closer to what the academic lit uses (TCR
calculation in particular uses just those classes with pretty much that
nomenclature in its computation), (b) it's very logical and obvious, (c) it fits
in 2 bytes, so fixed width, (d) it fits in one non-whitespace "token" so very
little script modification will be required in rule-qa et al. where
/\S+\s+\S+etc./ is used.

The "no manual classification" type would then be

us: unknown, marked as spam
uh: unknown, marked as ham

like this:

hh 0 ...path... RULES bayes=0.001
hh 0 ...path... RULES bayes=0.001

Comment 9 Duncan Findlay 2004-03-28 22:58:36 UTC

Subject: Re:  [review] improve mass-check so that it can be run from a different directory

> what about:
> 
> <manual class><result class> <score> <id> <rules> <value pairs>
> 
> with one-letter classes.  That gives us:
> 
> hh: manually ham, classed as ham
> hs: false positive
> sh: false negative
> ss: manually spam, classed as spam

The only problem with that is it's slightly harder for backward compat
if we ever need to have multiple character classes. (With a space we
can change a (\w)\s+(\w) to a (\w+)\s+(\w+) and still support the old
format). Not really a big deal, though.

Comment 10 Daniel Quinlan 2004-03-28 23:23:35 UTC

Subject: Re:  [review] improve mass-check so that it can be run from a different directory

bugzilla-daemon  <bugzilla-daemon@bugzilla.spamassassin.org> writes:

> I'd prefer that we stick with single characters, since that is what
> ArchiveIterator does. (It passes "s" or "h" around instead of "spam"
> or "ham") Furthermore, having it fixed width is a good thing imho.'

Changing ArchiveIterator is trivial, so I don't think this is a reason
worth considering.  If I recall correctly, the single character thing is
my fault, I just think it was a bad design design in retrospect.

> with one-letter classes.  That gives us:
>
> hh: manually ham, classed as ham
> hs: false positive
> sh: false negative
> ss: manually spam, classed as spam

I think the classes at least need to be separated by some character to
allow multiple-character classes.  If not whitespace, then a ',' or a
'-' character.

I'm certain the single character stuff is hack crap.  We should drop it.

Daniel

Comment 11 Daniel Quinlan 2004-03-28 23:25:00 UTC

Subject: Re:  [review] improve mass-check so that it can be run from a different directory

> The only problem with that is it's slightly harder for backward compat
> if we ever need to have multiple character classes. (With a space we
> can change a (\w)\s+(\w) to a (\w+)\s+(\w+) and still support the old
> format). Not really a big deal, though.

I agree 100% and I think it is a moderately-sized deal.

Comment 12 Justin Mason 2004-03-29 11:14:42 UTC

ok, I'll let you two decide on it then, I don't really care that much either way ;)

Comment 13 Duncan Findlay 2004-03-29 20:24:50 UTC

I'm going to leave it as single characters but in the form
<manual class> <result class> <score> <id> <rules> ...

I'll also fix the score to be %05.2 rather than %2d

Comment 14 Duncan Findlay 2004-05-13 20:14:16 UTC

I'm fixing this at the same time as 2853, so marking as dupe to consolidate.

*** This bug has been marked as a duplicate of 2853 ***