Bug 5270

Summary:	3.2.0 rescoring
Product:	Spamassassin	Reporter:	Justin Mason <jm>
Component:	Score Generation	Assignee:	SpamAssassin Developer Mailing List <dev>
Status:	RESOLVED FIXED
Severity:	blocker	CC:	schneecrash+apache
Priority:	P5
Version:	SVN Trunk (Latest Devel Version)
Target Milestone:	3.2.0
Hardware:	Other
OS:	other
Whiteboard:
Bug Depends on:	4687, 5271, 5284, 5285
Bug Blocks:	5257, 5110
Attachments:	freqs new scores

Description Justin Mason 2007-01-03 05:42:59 UTC

This is a ticket to track the 3.2.0 rescoring process.

reminder: here's the procedure we use;
http://wiki.apache.org/spamassassin/RescoreMassCheck

and the instructions for mass-checkers will be something like this:
http://wiki.apache.org/spamassassin/RescoreDetails

Comment 1 Justin Mason 2007-01-08 07:51:13 UTC

just added several bugs blocking this one; they're the bugs in the 3.2.0 queue
relating to rules which may effect results of mass-checks.

Comment 2 Justin Mason 2007-01-10 12:04:15 UTC

I just ran a quick experiment on the zone over the past day, to see what
perceptron tweaks work well on a 10% slice of last week's set1 logs, by
searching the HAM_PREFERENCE space between [1.0 .. 30.0] in 0.5 increments, and
THRESHOLD between [3.0 .. 10.0] in 0.25 increments (using an efficient
tesselating algorithm, of course), with the "validate-model" stuff as described
on http://wiki.apache.org/spamassassin/RunningPerceptron .

http://taint.org/x/2007/roc-test-set1.png is a ROC graph of the results.  (I
haven't multiplied the values by 100 to percentify them, so 1.0 == 100%, 0.1 ==
10%, 0.01 == 1%, 0.001 = 0.1%, you get the idea.)

http://taint.org/x/2007/roc-test-set1.txt is the raw data for that ROC
graph, sorted, in space-separated FP%, FN%, vm-name format.  (ignore the "set3"
typo; these are all "set1" logs really since there are no Bayes results.)

interesting to note:

- the perceptron generally conforms nicely to a neat ROC curve, except for a
"mirror" curve of occasional way-off results: those are the results where
HAM_PREFERENCE==1.0.  so we can discard that!

- the "sweet spot", IMO, is around 0.45% FPs, 3.9% FNs, which is
vm-set3-8.25-5.1875-100 - in other words, HAM_PREFERENCE=8.25 THRESHOLD=5.1875.
I'll try a runGA with that.

- here's a sample scores file from that vm,
http://taint.org/x/2007/roc-test-set1-scores.txt , if you're curious.


Henry, have I gone a bit overboard here? ;)  what else should I be trying?

Comment 3 Justin Mason 2007-02-12 09:31:56 UTC

ok, scores for scoreset 3 are checked in... they seem pretty good:

gen-set3-2.0-5.0-100/test --

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67518  99.90%
# Correctly spam:     116723  98.30%
# False positives:        68  0.10%
# False negatives:      2015  1.70%
# TCR(l=50): 21.927608  SpamRecall: 98.303%  SpamPrec: 99.942%

However the perceptron has gone pretty haywire for sets 0, 1, and 2,
producing seriously crappy results.  e.g.

gen-set2-2.0-4.625-100/test --

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67919  99.75%
# Correctly spam:      70874  59.48%
# False positives:       167  0.25%
# False negatives:     48282  40.52%
# TCR(l=50): 2.104040  SpamRecall: 59.480%  SpamPrec: 99.765%

gen-set0-2.0-4.0-100/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  67479  99.79%
# Correctly spam:      17951  15.10%
# False positives:       145  0.21%
# False negatives:    100935  84.90%
# TCR(l=50): 1.098914  SpamRecall: 15.099%  SpamPrec: 99.199%


gen-set0-2.0-5.0-100/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  67057  99.37%
# Correctly spam:      42186  35.30%
# False positives:       426  0.63%
# False negatives:     77323  64.70%
# TCR(l=50): 1.211776  SpamRecall: 35.299%  SpamPrec: 99.000%

gen-set1-2.0-4.7-100/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  67146  99.40%
# Correctly spam:      86230  72.41%
# False positives:       404  0.60%
# False negatives:     32853  27.59%
# TCR(l=50): 2.244604  SpamRecall: 72.412%  SpamPrec: 99.534%

gen-set1-2.0-6.0-100/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  66978  99.15%
# Correctly spam:      89378  75.06%
# False positives:       572  0.85%
# False negatives:     29705  24.94%
# TCR(l=50): 2.042415  SpamRecall: 75.055%  SpamPrec: 99.364%

gen-set1-2.0-7.0-300/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  66334  98.20%
# Correctly spam:      97985  82.28%
# False positives:      1216  1.80%
# False negatives:     21098  17.72%
# TCR(l=50): 1.454040  SpamRecall: 82.283%  SpamPrec: 98.774%

gen-set1-3.0-5.0-300/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  67137  99.39%
# Correctly spam:      81410  68.36%
# False positives:       413  0.61%
# False negatives:     37673  31.64%
# TCR(l=50): 2.041785  SpamRecall: 68.364%  SpamPrec: 99.495%

gen-set2-2.0-4.625-100/test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  67919  99.75%
# Correctly spam:      70874  59.48%
# False positives:       167  0.25%
# False negatives:     48282  40.52%
# TCR(l=50): 2.104040  SpamRecall: 59.480%  SpamPrec: 99.765%

by comparison, the existing (3.1.0) scores produce these
results on the test set:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67511  99.94%
# Correctly spam:      87965  73.87%
# False positives:        39  0.06%
# False negatives:     31118  26.13%
# TCR(l=50): 3.601155  SpamRecall: 73.869%  SpamPrec: 99.956%


The "scores" files are all very obviously iffy, full of zeroed
scores, e.g.

score ACT_NOW_CAPS                   2.700 # [0.000..2.700]
score ADVANCE_FEE_2                  0.000 # [0.000..2.700]
score ADVANCE_FEE_3                  0.000 # [0.000..3.600]
score ADVANCE_FEE_4                  3.900 # [0.000..3.900]
score BAD_CREDIT                     3.100 # [0.000..3.100]
score BAD_ENC_HEADER                 0.000 # [0.000..3.500]
score BANG_GUAR                      0.000 # [0.000..2.700]
score BILLION_DOLLARS                2.700 # [0.000..2.700]
score BODY_ENHANCEMENT               0.000 # [0.000..3.300]
score BODY_ENHANCEMENT2              0.000 # [0.000..3.100]
score CUM_SHOT                       0.000 # [0.000..2.800]
score DATE_IN_FUTURE_03_06           0.000 # [0.000..3.300]
score DATE_IN_FUTURE_06_12           3.100 # [0.000..3.100]
score DATE_IN_FUTURE_12_24           3.300 # [0.000..3.300]
score DATE_IN_FUTURE_24_48           0.000 # [0.000..3.500]
score DATE_IN_FUTURE_48_96           3.300 # [0.000..3.300]
score DATE_IN_FUTURE_96_XX           3.900 # [0.000..3.900]
score DATE_IN_PAST_03_06             0.000 # [0.000..2.500]
score DATE_IN_PAST_06_12             0.000 # [0.000..2.700]
score DATE_IN_PAST_12_24             0.000 # [0.000..2.500]

maybe the set 1 ruleset really is only capable of hitting 73%
of spam, but I doubt it, to be honest (esp since I've been
dogfooding set 1 on my server for a while).

I don't know what's going on here -- it may be time to start
debugging the perceptron.  Has anyone seen Henry recently? ;)

Comment 4 Justin Mason 2007-02-12 09:43:15 UTC

aha, I've figured it out.  bug in the masses scripts (again)

Comment 5 Justin Mason 2007-02-12 11:10:30 UTC

well, it wasn't just that.  still seems broken; a really good rule like
RCVD_FORGED_WROTE gets a score of 0.  I've fixed the bugs that had marked those
rules as immutable with a score of 0, but even so, the perceptron is setting
their evolved score as 0...

: jm 135...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/freqs
  9.376  14.6938   0.0000    1.000   1.00    0.00  RCVD_FORGED_WROTE
  9.348  14.6504   0.0000    1.000   1.00    0.00  RCVD_FORGED_WROTE2
  0.000   0.0001   0.0000    1.000   0.51    0.00  T_RCVD_FORGED_WROTE3
: jm 136...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/scores
score RCVD_FORGED_WROTE              0.000 # [0.000..4.162]
score RCVD_FORGED_WROTE2             0.000 # [0.000..4.162]
: exit=0 Mon Feb 12 19:07:29 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses
: jm 137...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/test
: exit=1 Mon Feb 12 19:07:35 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses
: jm 138...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/log
: exit=1 Mon Feb 12 19:07:38 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses
: jm 139...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/make.output
rule T_RCVD_FORGED_WROTE3 no longer exists; ignoring
: exit=0 Mon Feb 12 19:07:43 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses
: jm 140...; grep RCVD_FORGED_WROTE tmp/*
tmp/ranges.data:0.0 4.5 1 RCVD_FORGED_WROTE2
tmp/ranges.data:0.0 4.5 1 RCVD_FORGED_WROTE
tmp/rules.pl:           'RCVD_FORGED_WROTE2' => {
tmp/rules.pl:           'RCVD_FORGED_WROTE' => {
tmp/rules.pl:            'RCVD_FORGED_WROTE2' => '0',
tmp/rules.pl:            'RCVD_FORGED_WROTE' => '0',
tmp/scores.data:nRCVD_FORGED_WROTE
tmp/scores.data:nRCVD_FORGED_WROTE2


I checked tmp/tests.data, and the test number for RCVD_FORGED_WROTE really does
show up in mails in the perceptron input format, too. 

looks like a perceptron bug.

Comment 6 Duncan Findlay 2007-02-12 11:46:09 UTC

Are there false negatives with RCVD_FORGED_WROTE in it? I suppose it's plausible
(though unlikely) that it get a score of 0 if its simply unnecessary for
classification.

Comment 7 Justin Mason 2007-02-12 12:05:21 UTC

fwiw, I think I still had some stuff cached; I svn reverted
../rules/50_scores.cf, rm -rf tmp gen-cache, and re-ran bash ./runGA, and the
results (for scoreset 1, just checked in) were a lot better.  now for set 0 and
set 2...

Comment 8 Justin Mason 2007-02-13 07:48:49 UTC

hmm. still having problems; different ones, this time.  Now I'm running into
what looks like a perceptron bug for sure. it appears to be giving rules either
a score of 0, or the max score allowed for that rule's range...

# Correctly non-spam: 539380  99.72%
# Correctly spam:     657847  68.99%
# False positives:      1523  0.28%
# False negatives:    295698  31.01%
score ACT_NOW_CAPS                   0.000 # [0.000..2.160]
score ADVANCE_FEE_2                  0.000 # [0.000..2.240]
score ADVANCE_FEE_3                  2.960 # [0.000..2.960]
score ADVANCE_FEE_4                  3.040 # [0.000..3.040]
score APOSTROPHE_FROM                2.480 # [0.000..2.480]
score AXB_XMID_1212                  0.000 # [0.000..3.120]
score AXB_XMID_1510                  3.440 # [0.000..3.440]
score AXB_XMID_OEGOESNULL            0.000 # [0.000..3.440]
score AXB_XR_STULDAP                 0.000 # [0.000..2.560]
score BAD_CREDIT                     0.000 # [0.000..2.480]
score BAD_ENC_HEADER                 2.800 # [0.000..2.800]

this happens for both set0 and set2.  the FNs listed in false_negatives
seem like they should be classifiable just fine -- most already look like they'd
have scored >=5 with the SVN scoreset.

Comment 9 Justin Mason 2007-02-13 08:03:23 UTC

oops, didn't mean to do that

Comment 10 Justin Mason 2007-02-13 12:56:38 UTC

so I went back and forth with Henry about this; he suggested tweaking the -l
parameter to 0.00002 and lower, but that didn't really help.

however, I edited rules/50_scores.cf and changed all the "0" scores in the main
gen:mutable section to be "1" -- and lo and behold, that seems to help! there
must be some kind of dependency going on there -- very odd.  still investigating...

BTW I'm pretty sure the set 0 FP/FN rates are going to be atrocious.  current
tests look like a 30% FN rate there.  (sets 1, 2 and 3 are much more decent though)

Comment 11 Loren Wilton 2007-02-13 15:08:04 UTC

It would be interesting to know if this 30% FN rate equates to image spams at 
all closely.  I haven't been running without net tests in a while, but I think 
I'd well believe that without at least FuzzyOCR and the SARE stock rules, 30% 
would not be unreasonable.

Comment 12 Justin Mason 2007-02-13 15:28:50 UTC

btw, it's definitely not that the results are poor for scoreset 2 in general. 
for example, the "pre" result looks like this:

  SUMMARY for threshold 5.0:
  Correctly non-spam: 538869  99.62%
  Correctly spam:     898436  94.22%
  False positives:      2034  0.38%
  False negatives:     55109  5.78%
  TCR(l=50): 6.080933  SpamRecall: 94.221%  SpamPrec: 99.774%

yet the "post" result is *worse*:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67241  99.54%
# Correctly spam:     109483  91.94%
# False positives:       309  0.46%
# False negatives:      9600  8.06%
# TCR(l=50): 4.753812  SpamRecall: 91.938%  SpamPrec: 99.719%

that's actually about the best result I've been able to get for set2 with the
perceptron.  It would be logical to assume that the perceptron could find
similar results to the "pre" set, but it's not doing so...


(re that 30% -- much of that may still be a result of this bug btw)

Comment 13 Theo Van Dinter 2007-02-13 15:33:44 UTC

I also wonder if perhaps we need to do some more corpus cleaning and verifying
that we're not doing gigo.

Comment 14 Justin Mason 2007-02-14 09:14:34 UTC

best results I've gotten out of the perceptron so far for set 2 have been:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67468  99.88%
# Correctly spam:     112527  94.49%
# False positives:        82  0.12%
# False negatives:      6556  5.51%
# TCR(l=50): 11.175206  SpamRecall: 94.495%  SpamPrec: 99.927%

with intermittent cases where it just goes nuts and gives 30% FNs (with
most of the scores zeroed).  I can't particularly figure out what needs
to be done with the parameters to work around it, nor am I keen to sit
here trying sets of params in a trial-and-error fashion...

I'm going to go back to using the GA, see if I can get better results out
of that.  if anyone wants a try to fix the perceptron problems for set0
and set2, the full logs are on the zone at:

-rw-rw-r--   2 jm       other    1446756927 Feb  7 22:20
/export/home/jm/ftp/spamassassin/masses/spam-full.log
-rw-rw-r--   2 jm       other    413512085 Feb  7 22:15
/export/home/jm/ftp/spamassassin/masses/ham-full.log


Theo -- I'm pretty sure it's not a GIGO logs problem, since the results for set
1 and set 3 are quite good, and the freqs look good too.  actually, I'll upload the
freqs for set3, they're worth checking out.

Comment 15 Justin Mason 2007-02-14 09:15:32 UTC

Created attachment 3864 [details]
freqs

HELO_LOCALHOST, a bit of a surprise winner ;)

Comment 16 Justin Mason 2007-02-15 02:54:25 UTC

yay for the GA!  I resurrected craig-evolve.c and ran it -- here's the test
results from its run:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67498  99.92%
# Correctly spam:     115160  96.71%
# False positives:        52  0.08%
# False negatives:      3923  3.29%
# TCR(l=50): 18.255864  SpamRecall: 96.706%  SpamPrec: 99.955%

compare with the best results for the perceptron on the same data set,
after a *lot* of futzing with settings:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67468  99.88%
# Correctly spam:     112527  94.49%
# False positives:        82  0.12%
# False negatives:      6556  5.51%
# TCR(l=50): 11.175206  SpamRecall: 94.495%  SpamPrec: 99.927%

on the other hand, it took 8 hours to produce the GA results. ;)  but still,
a hell of a lot better... and it didn't require any tweaking or manual
knob-twiddling, it's just fire and forget.  I'll use the GA for the other
scoreset (0), and maybe try it again on sets 1 and 3 to see if it can beat the
perceptron FP%/FN% rates for those too.

Comment 17 Justin Mason 2007-02-15 04:01:02 UTC

for the record; the FP/FN test results from the set 1 perceptron run were:

    Correctly non-spam: 539494  99.74%
    Correctly spam:     914430  95.90%
    False positives:      1409  0.26%
    False negatives:     39115  4.10%
    TCR(l=50): 8.703007  SpamRecall: 95.898%  SpamPrec: 99.846%

Comment 18 Justin Mason 2007-02-16 04:22:09 UTC

ok, set 0 scores now in:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  66964  99.13%
# Correctly spam:     110426  92.73%
# False positives:       586  0.87%
# False negatives:      8657  7.27%
# TCR(l=50): 3.137313  SpamRecall: 92.730%  SpamPrec: 99.472%

phew, no 30% fn rate after all. ;)

for the record, the gen files are in: gen-set0-2.0-5.0-100-ga

Comment 19 Justin Mason 2007-02-16 17:35:07 UTC

GA results for set 1:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67347  99.70%
# Correctly spam:     114907  96.49%
# False positives:       203  0.30%
# False negatives:      4176  3.51%
# TCR(l=50): 8.312369  SpamRecall: 96.493%  SpamPrec: 99.824%

gen-set1-5.0-5.0-100-ga is the gen dir.

Comment 20 Justin Mason 2007-02-17 10:11:33 UTC

GA results for set 3:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  67494  99.92%
# Correctly spam:     117606  98.76%
# False positives:        56  0.08%
# False negatives:      1477  1.24%
# TCR(l=50): 27.842647  SpamRecall: 98.760%  SpamPrec: 99.952%

beats the perceptron's 1.70% FNs nicely ;)
gen-set3-5.0-5.0-100-ga is the dir.

(in passing, I used my $400 Dell laptop to produce set 1 last night.
it completed the GA run a lot faster than the zone did.  The zone
doesn't really provide decent CPU power any more.)

so, that's the lot (finally!).  to summarise, the results on the test sets are:

set 0
# Correctly non-spam:  66964  99.13%
# Correctly spam:     110426  92.73%
# False positives:       586  0.87%
# False negatives:      8657  7.27%
# TCR(l=50): 3.137313  SpamRecall: 92.730%  SpamPrec: 99.472%

set 1
# Correctly non-spam:  67347  99.70%
# Correctly spam:     114907  96.49%
# False positives:       203  0.30%
# False negatives:      4176  3.51%
# TCR(l=50): 8.312369  SpamRecall: 96.493%  SpamPrec: 99.824%

set 2
# Correctly non-spam:  67498  99.92%
# Correctly spam:     115160  96.71%
# False positives:        52  0.08%
# False negatives:      3923  3.29%
# TCR(l=50): 18.255864  SpamRecall: 96.706%  SpamPrec: 99.955%

set 3
# Correctly non-spam:  67494  99.92%
# Correctly spam:     117606  98.76%
# False positives:        56  0.08%
# False negatives:      1477  1.24%
# TCR(l=50): 27.842647  SpamRecall: 98.760%  SpamPrec: 99.952%

please take a look at 50_scores.cf and see if you can spot any
issues.

The one thing I can see is that we now have lint failures in
trunk, because there are scores in 50_scores.cf for rules
from rulesrc.   I'm not sure how to solve that... either:

- (a) stop "score set for unknown rule name" being a lint error that is warned
  about, or

- (b) go through the rulesrc tree, finding the rules that were in the active
  list and therefore which now have scores, and mark them with "tflags publish"
  so they are always published to the active ruleset.

I'm leaning towards (b).

Comment 21 Matthias Leisi 2007-02-17 12:20:01 UTC

(In reply to comment #20)

> please take a look at 50_scores.cf and see if you can spot any
> issues.

Regarding the RCVD_IN_DNSWL_* rules (disclaimer: I'm involved with the dnswl.org
project): 

1) RCVD_IN_DNSWL_HI does not have a score. Could it be because it only has very
few samples in my corpus and most likely not very many in other peoples'
corpora? Would it make sense to add a score for _HI?

2) RCVD_IN_DNSWL_LOW received a better score (-0.698) than RCVD_IN_DNSWL_MED
(-0.498). This seems counter-intuitive, since *_LOW contains servers which may
also emit a certain amount of spam. Would it make sense to leave the scores at
-1, -4 and -8 as Theo had them in his sandbox?
http://svn.apache.org/viewvc/spamassassin/rules/trunk/sandbox/felicity/70_dnswl.cf?view=markup

(As a reminder: the LOW, MED and HI indicate the trustworthiness assigned to a
particular source.)

Comment 22 Justin Mason 2007-02-18 03:25:36 UTC

Created attachment 3868 [details]
new scores

Here's a copy of the new scores (easier to look at as an attachment!)

the logs used are now archived at:

-rw-rw-r--   1 jm	other	 105312770 Feb 18 11:08
/home/corpus-rsync/ARCHIVE/3.2.0/rescore-logs-bug5270.tgz

on the zone.


> Regarding the RCVD_IN_DNSWL_* rules (disclaimer: I'm involved with the
> dnswl.org project): 
> 
> 1) RCVD_IN_DNSWL_HI does not have a score. Could it be because it only has
> very few samples in my corpus and most likely not very many in other
> peoples' corpora? Would it make sense to add a score for _HI?

Hmm. interesting! I think it's using a score of -1.0 (the default for
a "nice" rule). Because there are so few hits, the scoring system
didn't feel confident enough to assign a score, I think.

> 2) RCVD_IN_DNSWL_LOW received a better score (-0.698) than RCVD_IN_DNSWL_MED
> (-0.498). This seems counter-intuitive, since *_LOW contains servers which
> may also emit a certain amount of spam. Would it make sense to leave the
> scores at -1, -4 and -8 as Theo had them in his sandbox?
>
http://svn.apache.org/viewvc/spamassassin/rules/trunk/sandbox/felicity/70_dnswl.cf?view=markup


Yes, it may be worth manually resetting these -- right now, they
are not really going to have any effect.


by the way similarly, it looks like the IADB rules have also been
effectively zeroed, mainly due to their scores being manually set
in rulesrc/sandbox/felicity/70_iadb.cf (I failed to spot them!):

  0.104   0.0008   0.2872    0.003   0.64   -0.00  RCVD_IN_IADB_LISTED
  0.090   0.0000   0.2487    0.000   0.63   -0.00  RCVD_IN_IADB_EPIA
  0.092   0.0008   0.2531    0.003   0.63   -0.00  RCVD_IN_IADB_SPF
  0.091   0.0008   0.2487    0.003   0.63   -0.00  RCVD_IN_IADB_OPTIN_GT50
  0.086   0.0000   0.2383    0.000   0.62   -0.00  RCVD_IN_IADB_SENDERID
  0.086   0.0000   0.2369    0.000   0.62   -0.00  RCVD_IN_IADB_EDDB
  0.011   0.0000   0.0296    0.000   0.52   -0.00  RCVD_IN_IADB_OPTIN
  0.006   0.0008   0.0148    0.054   0.51   -0.00  RCVD_IN_IADB_UT_CPR_MAT
  0.005   0.0008   0.0133    0.059   0.51   -0.00  RCVD_IN_IADB_MI_CPR_MAT
  0.008   0.0017   0.0178    0.086   0.51   -0.00  RCVD_IN_IADB_RDNS
  0.001   0.0000   0.0015    0.000   0.50   -0.00  RCVD_IN_IADB_OPTIN_LT50
  0.001   0.0000   0.0015    0.000   0.50   -0.00  RCVD_IN_IADB_DOPTIN_GT50
  0.001   0.0000   0.0015    0.000   0.50   -4.00  RCVD_IN_IADB_DOPTIN
  0.001   0.0000   0.0015    0.000   0.50   -0.00  RCVD_IN_IADB_DOPTIN_LT50
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_UT_CPEAR
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_UT_CPR_30
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_LOOSE
  0.000   0.0000   0.0000    0.500   0.50   -6.00  RCVD_IN_IADB_ML_DOPTIN
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_DK
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_NOCONTROL
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_UNVERIFIED_1
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_MI_CPEAR
  0.000   0.0000   0.0000    0.500   0.50   -2.20  RCVD_IN_IADB_VOUCHED
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_GOODMAIL
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_MI_CPR_30
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_OPTOUTONLY
  0.000   0.0000   0.0000    0.500   0.50   -0.00  RCVD_IN_IADB_UNVERIFIED_2
  0.000   0.0000   0.0000    0.500   0.50   -8.00  RCVD_IN_IADB_OOO

Comment 23 Justin Mason 2007-02-20 05:16:03 UTC

assuming there's no other negative comments, and I get a chance, I'll trying
fixing those RCVD_IN_DNSWL scores asap to match what Matthias suggested. (If
those values affect accuracy rates noticeably though I may tweak them to be
closer to zero.)

Do we want to assign "real" scores for those IADB rules, or are the scores from
Theo's sandbox file ok?  (I assume they're OK.)

I will then generate the STATISTICS files and check those in.

Finally, I'll add "tflags publish" to the published rules from rulesrc as noted
in comment 20.

Once that's done, this bug can be closed... shout now if you disagree about
this plan, of course ;)

Comment 24 Justin Mason 2007-02-20 05:20:27 UTC

> Finally, I'll add "tflags publish" to the published rules from rulesrc as noted
> in comment 20.

actually, maybe it makes more sense to just cut those rules out of the
rulesrc/sandbox files, and move them into a new file in rules/ .  hmm.... let's
see how it goes.

Comment 25 Justin Mason 2007-02-21 02:01:53 UTC

*** Bug 5343 has been marked as a duplicate of this bug. ***

Comment 26 Justin Mason 2007-02-21 05:02:27 UTC

(In reply to comment #23)
> Finally, I'll add "tflags publish" to the published rules from rulesrc as noted
> in comment 20.

ok, I've now done this.  It seems to work OK...  however, people with sandbox
rules that were "good enough" and included in the evolved scoreset now need
to be careful to update rules/50_scores.cf if they change/remove those rules!

(I also fixed a couple of minor issues in the imageinfo meta rules; the
evolver had disabled some of the rules the metas relied on.)

Comment 27 snowcrash+apache 2007-02-21 08:26:48 UTC

as follow-up to my duplicate@5343, up'ing to r510010, after a,

	make distclean

& full rebuild,

	make test

now reports,

	All tests successful, 17 tests skipped.
	Files=127, Tests=1807, 6406 wallclock secs (3236.97 cusr + 697.55 csys =
3934.52 CPU)

so, for me, that leaves only #5340 ...

thanks.

Comment 28 Justin Mason 2007-02-21 12:01:55 UTC

annoyingly, the basic FP/FN rate when I run fp-fn-statistics *now* for set 3 is
0.1% higher in FNs than when I ran it during the score generation :(  

need to figure out wtf is up there before I can twiddle the IADB and DNSWL scores.

Comment 29 Justin Mason 2007-02-22 06:55:19 UTC

> annoyingly, the basic FP/FN rate when I run fp-fn-statistics *now* for set 3 is
> 0.1% higher in FNs than when I ran it during the score generation :(  
> 
> need to figure out wtf is up there before I can twiddle the IADB and DNSWL scores.

aha, I think I have it.  there are certain rules like FM_MORTGAGE6PLUS that hit
enough mail in the nightly-mass-check to be promoted, but didn't hit any (recent
enough?) mail in the rescoring check to get into the perceptron input:

masses/gen-set0-2.0-5.0-100-ga/freqs:  0.000   0.0003   0.0000    1.000   0.51 
  0.00  FM_MORTGAGE6PLUS
masses/gen-set0-2.0-5.0-100-ga/make.output:rule FM_MORTGAGE6PLUS: immutable and
zero due to low hitrate
masses/gen-set0-2.0-5.0-100-ga/make.output:ignoring 'FM_MORTGAGE6PLUS': score
and range == 0

so this then got ignored by the perceptron.  however, the hits are still in the
logs, and the rule is still in 72_active.cf.  after the perceptron completes,
rewrite-scores does not add a line to 50_scores.cf for this rule (because it's
not in the scores file output by perceptron).

when fp-fn-statistics is run, later, parse-rules-for-masses is run in turn,
generating a rules.pl file containing:

           'FM_MORTGAGE6PLUS' => {
                                   'lang' => '',
                                   'score' => '1',
                                   'describe' => 'Looks like a mortgage spam (6+)',
                                   'tflags' => '',
                                   'type' => 'meta',
                                   'issubrule' => '0',
                                   'mutable' => 1,
                                   'eval' => '0',
                                   'depends' => [
                                                  '__FM_MORTGAGE6PLUS'
                                                ],
                                   'code' => '(__FM_MORTGAGE6PLUS)'
                                 },

note: a score of 1!  this is because the rule exists in 72_active.cf, but has no
score in 50_scores.cf.  fp-fn-statistics then uses that to compute its accuracy
rates.

this then accounts for the difference, I'd say; a few rules like that with
0.0003% hitrates, and scores changing from 0.0 to 1.0, could add up to ~0.1% FN
improvement and ~0.01% additional FPs.

to fix: we need to indicate that these rules were immutable and zeroed, so that
rewrite-scores will add a score of 0 for them to 50_scores.cf after the
perceptron is run.

Comment 30 Justin Mason 2007-02-22 08:00:09 UTC

another possible issue.  Was ALL_TRUSTED supposed to be mutable?

ifplugin Mail::SpamAssassin::Plugin::RelayEval
# <gen:mutable>
score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800
...
# </gen:mutable>

# Informational rules about Received header parsing
score NO_RELAYS -0.001
score UNPARSEABLE_RELAY 0.001


I'd consider it a whitelisting rule, based on user configuration -- therefore
nonmutable.

Comment 31 Justin Mason 2007-02-22 08:13:03 UTC

false alarm; looks like it's always been mutable!  (not that I'm sure that's a
good idea, but it'd be a separate issue. ;)

Comment 32 Justin Mason 2007-02-24 13:07:45 UTC

I think I've fixed the "low-scoring rule gets 1.0 default score" bug now.
Also note a new issue from bug 5110 -- EXTRA_MPART_TYPE was given too
high a score.  locking that lower.
Now to rebuild the FP-FN-stats and STATISTICS files using those scores...

Comment 33 Justin Mason 2007-02-25 08:29:09 UTC

ok; EXTRA_MPART_TYPE set to 1.0, IADB rules fixed (the important ones at least)
to the rulesrc scores, and DNSWL scores reinstated.  

Effects: in all scoresets, the FP% percentage drops and FN% goes up a tiny
amount; in all cases, though, the TCR went up, so I think it's worth it.  Here
are the final results...

set 0
# Correctly non-spam:  67074  99.30%
# Correctly spam:     109992  92.37%
# False positives:       476  0.70%
# False negatives:      9091  7.63%
# TCR(l=50): 3.620534  SpamRecall: 92.366%  SpamPrec: 99.569%

set 1
# Correctly non-spam:  67391  99.76%
# Correctly spam:     114172  95.88%
# False positives:       159  0.24%
# False negatives:      4911  4.12%
# TCR(l=50): 9.259233  SpamRecall: 95.876%  SpamPrec: 99.861%

set 2
# Correctly non-spam:  67507  99.94%
# Correctly spam:     114861  96.45%
# False positives:        43  0.06%
# False negatives:      4222  3.55%
# TCR(l=50): 18.688481  SpamRecall: 96.455%  SpamPrec: 99.963%

set 3
# Correctly non-spam:  67503  99.93%
# Correctly spam:     117457  98.63%
# False positives:        47  0.07%
# False negatives:      1626  1.37%
# TCR(l=50): 29.950453  SpamRecall: 98.635%  SpamPrec: 99.960%

Comment 34 Justin Mason 2007-02-26 08:56:39 UTC

ok, I may need to get the brown bag out... as noted in bug 5285,
we needed to reuse the T_RCVD_IN_PBL_WITH_NJABL_DUL hits as RCVD_IN_PBL
hits for the set 1 and set 3 GA runs.  I forgot to do this :(
As a result, RCVD_IN_PBL was given a score of basically zero.

I'm rerunning the GA for set 1 and set 3 now, with
s/T_RCVD_IN_PBL_WITH_NJABL_DUL/RCVD_IN_PBL/g .  It looks likely to
be a little bit ahead of the current set1 / set3 results... let's
see.  It'll certainly have a more useful score for RCVD_IN_PBL
(although probably less than 1.0).

Comment 35 Justin Mason 2007-02-27 06:44:47 UTC

after re-running set 1 and set 3 with PBL working this time, I get

set 1:
# Correctly non-spam:  67386  99.76%
# Correctly spam:     114216  95.91%
# False positives:       164  0.24%
# False negatives:      4867  4.09%
# TCR(l=50): 9.113262  SpamRecall: 95.913%  SpamPrec: 99.857%

set 3:
# Correctly non-spam:  67508  99.94%
# Correctly spam:     117293  98.50%
# False positives:        42  0.06%
# False negatives:      1790  1.50%
# TCR(l=50): 30.612596  SpamRecall: 98.497%  SpamPrec: 99.964%

gen-set1-5.0-5.0-100-pblfix and gen-set3-5.0-5.0-100-pblfix are the 2 gen dirs
used.

PBL scores come out as:

  score RCVD_IN_PBL 0 0.509 0 0.905 # n=0 n=2

and I've verified that EXTRA_MPART_TYPE, FM_MORTGAGE6PLUS, the DNSWL and IADB
rules are all working correctly as before.

Comment 36 Justin Mason 2007-02-27 07:02:39 UTC

(also, updated /home/corpus-rsync/ARCHIVE/3.2.0/rescore-logs-bug5270.tgz .)

Comment 37 Duncan Findlay 2007-03-13 15:34:05 UTC

FWIW, ALL_TRUSTED is in a <gen:mutable> section, but immutable by
score-ranges-for-freqs due to tflags userconf.