SA Bugzilla – Bug 3096
[review] improve mass-check so that it can be run from a different directory
Last modified: 2004-05-13 12:14:16 UTC
[Just in case anyone is thinking about re-orging the programs under the 'masses' directory, I'm passing these ideas along. If I had more time, I'd take a crack it, but can't do so at the moment.] I want to run mass-check in much the same fashion that Bob Menschel documents in http://www.exit0.us/index.php/BobCorpusTestV2. However, I'd prefer it if I could keep my corpus, rules directory, user_prefs directory, and results directory in a directory other than SA's build directory. This is mostly possilbe but there are a few places where mass-check depends upon being run within the SA hierarchy: This is okay, because we can invoke mass-check by its full path under the build directory: use lib "$FindBin::Bin/../lib"; and this is okay because we can override it with the -c option: use lib "$FindBin::Bin/../lib"; but these are problematic: $spamtest = new Mail::SpamAssassin ({ 'debug' => $opt_debug, 'rules_filename' => $opt_c, 'userprefs_filename' => "$FindBin::Bin/spamassassin/user_prefs", 'site_rules_filename' => "$FindBin::Bin/spamassassin/local.cf", 'userstate_dir' => "$FindBin::Bin/spamassassin", because they (apparently) don't have any switches that can be used to override those settings. In a separate vein, many of the tools in the mass-check directory that calculate hit frequencies, scores and such, often obtain their raw data by either parsing reports written by other tools, or by for example reading in a file produced by parsing the rules defintioons, via a "use" directive. This all works but can make writing other tools somewhat more difficult because it is necessary to cut and various routines that parse reports, etc, from other program source files, and sometimes this parsing is rather specific to a particular program's needs and it has to be modified. Better, I think, would be to move the various functions for parsing rules, calculating statistics of various sorts, and so on, into an easier to understand and to use, module structure.
Subject: RE: New: RFE: improve mass-check so that it can be run from a different directory, and implement module structure to support 'masses' tools > and this is okay because we can override it with the -c option: > use lib "$FindBin::Bin/../lib"; that should be: $opt_c = "$FindBin::Bin/../rules";
related to bug 2853
Created attachment 1867 [details] Proposed patch Okay this patch allows mass-check to be run anywhere. If the --dist option is given, (or it auto-detects, based on ../spamassassin.raw) it assumes it's in its usual home, masses/. Otherwise, it should pick up the default paths for stuff (local rules, user_prefs, etc) and it should use ~/.spamassassin/mass-check as userstate, unless overriden. Instead of output to the traditional spam.log and ham.log, it outputs to a single file, masses.log. (jm suggested this in bug 2853 comment 2) The first character of the line is now s, h, N or P depending on whether the message was correctly marked as spam, correctly marked as ham, a false negative or false positive. The intent was that this is human readable, and useful for quick statistics. On IRC quinlan suggested that this might not be ideal, and perhaps two fields would be better, or that one field could be s-c, s-m for spam-caught, spam-missed, etc. I chose the field the way I did based on the fact that the old scripts generally expect a one character field, so it required less changes. Since I intend to re-write the rest of the masses/ stuff, changing it shouldn't be a problem. Also, I find a two field system confusing to read, I always forget which is which ;-) Also, this code is probably not tested as well as it should be. I'm not running HEAD, so it's tough for me to test (not making excuses, just saying...).
implement module stuff is bug 2853. ccing quinlan since we need to coordinate to prevent the impact on nightly-builds. Also, assigning to me.
Is nobody going to comment on this?? If so, I'm going to have to commit.... ;-)
> Is nobody going to comment on this?? -0.5 not quite ready yet The patch really needs to fix all of the scripts in masses to handle the new format, including the ones under rule-qa. I agree with using a single output file, but I'd like the new format (the format not the code itself) to be sufficient to handle: - message classes other than spam and ham (both for the manual classification as well as the result determined by SpamAssassin). - using "unknown" or "none" for unclassified checks (like "./mass-check --mbox file.mbox" with no --spam or --ham argument or class specification. I propose the following format: <manual class> <result class> <score> <id> <rules> <value pairs> where, for our current code, <manual class> is: "spam" | "ham" | "none" and <result class> is: "spam" | "ham" I'd also change the score to be the precise floating point score while we're at it (that, at least, should be an easy change).
Subject: Re: [review] improve mass-check so that it can be run from a different directory > -0.5 not quite ready yet True. > The patch really needs to fix all of the scripts in masses to handle the > new format, including the ones under rule-qa. I plan on re-writing most of the other scripts (I'm part way there). But as far as rule-qa, I'll probably need some help sorting through it. > I propose the following format: > > <manual class> <result class> <score> <id> <rules> <value pairs> > > where, for our current code, <manual class> is: > > "spam" | "ham" | "none" > > and <result class> is: > > "spam" | "ham" I'd prefer that we stick with single characters, since that is what ArchiveIterator does. (It passes "s" or "h" around instead of "spam" or "ham") Furthermore, having it fixed width is a good thing imho. > I'd also change the score to be the precise floating point score while > we're at it (that, at least, should be an easy change). Agreed, but all the scripts need to change too. Since I'm rewriting them to share code, that won't be hard.
+1 on duncan's latest comments. '> I propose the following format: > <manual class> <result class> <score> <id> <rules> <value pairs> > where, for our current code, <manual class> is: > "spam" | "ham" | "none" > and <result class> is: > "spam" | "ham" I'd prefer that we stick with single characters, since that is what ArchiveIterator does. (It passes "s" or "h" around instead of "spam" or "ham") Furthermore, having it fixed width is a good thing imho.' what about: <manual class><result class> <score> <id> <rules> <value pairs> with one-letter classes. That gives us: hh: manually ham, classed as ham hs: false positive sh: false negative ss: manually spam, classed as spam That's handy because (a) it's closer to what the academic lit uses (TCR calculation in particular uses just those classes with pretty much that nomenclature in its computation), (b) it's very logical and obvious, (c) it fits in 2 bytes, so fixed width, (d) it fits in one non-whitespace "token" so very little script modification will be required in rule-qa et al. where /\S+\s+\S+etc./ is used. The "no manual classification" type would then be us: unknown, marked as spam uh: unknown, marked as ham like this: hh 0 ...path... RULES bayes=0.001 hh 0 ...path... RULES bayes=0.001
Subject: Re: [review] improve mass-check so that it can be run from a different directory > what about: > > <manual class><result class> <score> <id> <rules> <value pairs> > > with one-letter classes. That gives us: > > hh: manually ham, classed as ham > hs: false positive > sh: false negative > ss: manually spam, classed as spam The only problem with that is it's slightly harder for backward compat if we ever need to have multiple character classes. (With a space we can change a (\w)\s+(\w) to a (\w+)\s+(\w+) and still support the old format). Not really a big deal, though.
Subject: Re: [review] improve mass-check so that it can be run from a different directory bugzilla-daemon <bugzilla-daemon@bugzilla.spamassassin.org> writes: > I'd prefer that we stick with single characters, since that is what > ArchiveIterator does. (It passes "s" or "h" around instead of "spam" > or "ham") Furthermore, having it fixed width is a good thing imho.' Changing ArchiveIterator is trivial, so I don't think this is a reason worth considering. If I recall correctly, the single character thing is my fault, I just think it was a bad design design in retrospect. > with one-letter classes. That gives us: > > hh: manually ham, classed as ham > hs: false positive > sh: false negative > ss: manually spam, classed as spam I think the classes at least need to be separated by some character to allow multiple-character classes. If not whitespace, then a ',' or a '-' character. I'm certain the single character stuff is hack crap. We should drop it. Daniel
Subject: Re: [review] improve mass-check so that it can be run from a different directory > The only problem with that is it's slightly harder for backward compat > if we ever need to have multiple character classes. (With a space we > can change a (\w)\s+(\w) to a (\w+)\s+(\w+) and still support the old > format). Not really a big deal, though. I agree 100% and I think it is a moderately-sized deal.
ok, I'll let you two decide on it then, I don't really care that much either way ;)
I'm going to leave it as single characters but in the form <manual class> <result class> <score> <id> <rules> ... I'll also fix the score to be %05.2 rather than %2d
I'm fixing this at the same time as 2853, so marking as dupe to consolidate. *** This bug has been marked as a duplicate of 2853 ***