|
SA Bugzilla – Full Text Bug Listing |
Summary: | 3.2.0 rescoring | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Justin Mason <jm> |
Component: | Score Generation | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | blocker | CC: | schneecrash+apache |
Priority: | P5 | ||
Version: | SVN Trunk (Latest Devel Version) | ||
Target Milestone: | 3.2.0 | ||
Hardware: | Other | ||
OS: | other | ||
Whiteboard: | |||
Bug Depends on: | 4687, 5271, 5284, 5285 | ||
Bug Blocks: | 5257, 5110 | ||
Attachments: |
freqs
new scores |
Description
Justin Mason
2007-01-03 05:42:59 UTC
just added several bugs blocking this one; they're the bugs in the 3.2.0 queue relating to rules which may effect results of mass-checks. I just ran a quick experiment on the zone over the past day, to see what perceptron tweaks work well on a 10% slice of last week's set1 logs, by searching the HAM_PREFERENCE space between [1.0 .. 30.0] in 0.5 increments, and THRESHOLD between [3.0 .. 10.0] in 0.25 increments (using an efficient tesselating algorithm, of course), with the "validate-model" stuff as described on http://wiki.apache.org/spamassassin/RunningPerceptron . http://taint.org/x/2007/roc-test-set1.png is a ROC graph of the results. (I haven't multiplied the values by 100 to percentify them, so 1.0 == 100%, 0.1 == 10%, 0.01 == 1%, 0.001 = 0.1%, you get the idea.) http://taint.org/x/2007/roc-test-set1.txt is the raw data for that ROC graph, sorted, in space-separated FP%, FN%, vm-name format. (ignore the "set3" typo; these are all "set1" logs really since there are no Bayes results.) interesting to note: - the perceptron generally conforms nicely to a neat ROC curve, except for a "mirror" curve of occasional way-off results: those are the results where HAM_PREFERENCE==1.0. so we can discard that! - the "sweet spot", IMO, is around 0.45% FPs, 3.9% FNs, which is vm-set3-8.25-5.1875-100 - in other words, HAM_PREFERENCE=8.25 THRESHOLD=5.1875. I'll try a runGA with that. - here's a sample scores file from that vm, http://taint.org/x/2007/roc-test-set1-scores.txt , if you're curious. Henry, have I gone a bit overboard here? ;) what else should I be trying? ok, scores for scoreset 3 are checked in... they seem pretty good: gen-set3-2.0-5.0-100/test -- # SUMMARY for threshold 5.0: # Correctly non-spam: 67518 99.90% # Correctly spam: 116723 98.30% # False positives: 68 0.10% # False negatives: 2015 1.70% # TCR(l=50): 21.927608 SpamRecall: 98.303% SpamPrec: 99.942% However the perceptron has gone pretty haywire for sets 0, 1, and 2, producing seriously crappy results. e.g. gen-set2-2.0-4.625-100/test -- # SUMMARY for threshold 5.0: # Correctly non-spam: 67919 99.75% # Correctly spam: 70874 59.48% # False positives: 167 0.25% # False negatives: 48282 40.52% # TCR(l=50): 2.104040 SpamRecall: 59.480% SpamPrec: 99.765% gen-set0-2.0-4.0-100/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 67479 99.79% # Correctly spam: 17951 15.10% # False positives: 145 0.21% # False negatives: 100935 84.90% # TCR(l=50): 1.098914 SpamRecall: 15.099% SpamPrec: 99.199% gen-set0-2.0-5.0-100/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 67057 99.37% # Correctly spam: 42186 35.30% # False positives: 426 0.63% # False negatives: 77323 64.70% # TCR(l=50): 1.211776 SpamRecall: 35.299% SpamPrec: 99.000% gen-set1-2.0-4.7-100/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 67146 99.40% # Correctly spam: 86230 72.41% # False positives: 404 0.60% # False negatives: 32853 27.59% # TCR(l=50): 2.244604 SpamRecall: 72.412% SpamPrec: 99.534% gen-set1-2.0-6.0-100/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 66978 99.15% # Correctly spam: 89378 75.06% # False positives: 572 0.85% # False negatives: 29705 24.94% # TCR(l=50): 2.042415 SpamRecall: 75.055% SpamPrec: 99.364% gen-set1-2.0-7.0-300/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 66334 98.20% # Correctly spam: 97985 82.28% # False positives: 1216 1.80% # False negatives: 21098 17.72% # TCR(l=50): 1.454040 SpamRecall: 82.283% SpamPrec: 98.774% gen-set1-3.0-5.0-300/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 67137 99.39% # Correctly spam: 81410 68.36% # False positives: 413 0.61% # False negatives: 37673 31.64% # TCR(l=50): 2.041785 SpamRecall: 68.364% SpamPrec: 99.495% gen-set2-2.0-4.625-100/test: # SUMMARY for threshold 5.0: # Correctly non-spam: 67919 99.75% # Correctly spam: 70874 59.48% # False positives: 167 0.25% # False negatives: 48282 40.52% # TCR(l=50): 2.104040 SpamRecall: 59.480% SpamPrec: 99.765% by comparison, the existing (3.1.0) scores produce these results on the test set: # SUMMARY for threshold 5.0: # Correctly non-spam: 67511 99.94% # Correctly spam: 87965 73.87% # False positives: 39 0.06% # False negatives: 31118 26.13% # TCR(l=50): 3.601155 SpamRecall: 73.869% SpamPrec: 99.956% The "scores" files are all very obviously iffy, full of zeroed scores, e.g. score ACT_NOW_CAPS 2.700 # [0.000..2.700] score ADVANCE_FEE_2 0.000 # [0.000..2.700] score ADVANCE_FEE_3 0.000 # [0.000..3.600] score ADVANCE_FEE_4 3.900 # [0.000..3.900] score BAD_CREDIT 3.100 # [0.000..3.100] score BAD_ENC_HEADER 0.000 # [0.000..3.500] score BANG_GUAR 0.000 # [0.000..2.700] score BILLION_DOLLARS 2.700 # [0.000..2.700] score BODY_ENHANCEMENT 0.000 # [0.000..3.300] score BODY_ENHANCEMENT2 0.000 # [0.000..3.100] score CUM_SHOT 0.000 # [0.000..2.800] score DATE_IN_FUTURE_03_06 0.000 # [0.000..3.300] score DATE_IN_FUTURE_06_12 3.100 # [0.000..3.100] score DATE_IN_FUTURE_12_24 3.300 # [0.000..3.300] score DATE_IN_FUTURE_24_48 0.000 # [0.000..3.500] score DATE_IN_FUTURE_48_96 3.300 # [0.000..3.300] score DATE_IN_FUTURE_96_XX 3.900 # [0.000..3.900] score DATE_IN_PAST_03_06 0.000 # [0.000..2.500] score DATE_IN_PAST_06_12 0.000 # [0.000..2.700] score DATE_IN_PAST_12_24 0.000 # [0.000..2.500] maybe the set 1 ruleset really is only capable of hitting 73% of spam, but I doubt it, to be honest (esp since I've been dogfooding set 1 on my server for a while). I don't know what's going on here -- it may be time to start debugging the perceptron. Has anyone seen Henry recently? ;) aha, I've figured it out. bug in the masses scripts (again) well, it wasn't just that. still seems broken; a really good rule like RCVD_FORGED_WROTE gets a score of 0. I've fixed the bugs that had marked those rules as immutable with a score of 0, but even so, the perceptron is setting their evolved score as 0... : jm 135...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/freqs 9.376 14.6938 0.0000 1.000 1.00 0.00 RCVD_FORGED_WROTE 9.348 14.6504 0.0000 1.000 1.00 0.00 RCVD_FORGED_WROTE2 0.000 0.0001 0.0000 1.000 0.51 0.00 T_RCVD_FORGED_WROTE3 : jm 136...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/scores score RCVD_FORGED_WROTE 0.000 # [0.000..4.162] score RCVD_FORGED_WROTE2 0.000 # [0.000..4.162] : exit=0 Mon Feb 12 19:07:29 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses : jm 137...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/test : exit=1 Mon Feb 12 19:07:35 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses : jm 138...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/log : exit=1 Mon Feb 12 19:07:38 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses : jm 139...; grep RCVD_FORGED_WROTE gen-set2-2.0-4.625-100/make.output rule T_RCVD_FORGED_WROTE3 no longer exists; ignoring : exit=0 Mon Feb 12 19:07:43 GMT 2007; cd /export/home/jm/ftp/spamassassin/masses : jm 140...; grep RCVD_FORGED_WROTE tmp/* tmp/ranges.data:0.0 4.5 1 RCVD_FORGED_WROTE2 tmp/ranges.data:0.0 4.5 1 RCVD_FORGED_WROTE tmp/rules.pl: 'RCVD_FORGED_WROTE2' => { tmp/rules.pl: 'RCVD_FORGED_WROTE' => { tmp/rules.pl: 'RCVD_FORGED_WROTE2' => '0', tmp/rules.pl: 'RCVD_FORGED_WROTE' => '0', tmp/scores.data:nRCVD_FORGED_WROTE tmp/scores.data:nRCVD_FORGED_WROTE2 I checked tmp/tests.data, and the test number for RCVD_FORGED_WROTE really does show up in mails in the perceptron input format, too. looks like a perceptron bug. Are there false negatives with RCVD_FORGED_WROTE in it? I suppose it's plausible (though unlikely) that it get a score of 0 if its simply unnecessary for classification. fwiw, I think I still had some stuff cached; I svn reverted ../rules/50_scores.cf, rm -rf tmp gen-cache, and re-ran bash ./runGA, and the results (for scoreset 1, just checked in) were a lot better. now for set 0 and set 2... hmm. still having problems; different ones, this time. Now I'm running into what looks like a perceptron bug for sure. it appears to be giving rules either a score of 0, or the max score allowed for that rule's range... # Correctly non-spam: 539380 99.72% # Correctly spam: 657847 68.99% # False positives: 1523 0.28% # False negatives: 295698 31.01% score ACT_NOW_CAPS 0.000 # [0.000..2.160] score ADVANCE_FEE_2 0.000 # [0.000..2.240] score ADVANCE_FEE_3 2.960 # [0.000..2.960] score ADVANCE_FEE_4 3.040 # [0.000..3.040] score APOSTROPHE_FROM 2.480 # [0.000..2.480] score AXB_XMID_1212 0.000 # [0.000..3.120] score AXB_XMID_1510 3.440 # [0.000..3.440] score AXB_XMID_OEGOESNULL 0.000 # [0.000..3.440] score AXB_XR_STULDAP 0.000 # [0.000..2.560] score BAD_CREDIT 0.000 # [0.000..2.480] score BAD_ENC_HEADER 2.800 # [0.000..2.800] this happens for both set0 and set2. the FNs listed in false_negatives seem like they should be classifiable just fine -- most already look like they'd have scored >=5 with the SVN scoreset. oops, didn't mean to do that so I went back and forth with Henry about this; he suggested tweaking the -l parameter to 0.00002 and lower, but that didn't really help. however, I edited rules/50_scores.cf and changed all the "0" scores in the main gen:mutable section to be "1" -- and lo and behold, that seems to help! there must be some kind of dependency going on there -- very odd. still investigating... BTW I'm pretty sure the set 0 FP/FN rates are going to be atrocious. current tests look like a 30% FN rate there. (sets 1, 2 and 3 are much more decent though) It would be interesting to know if this 30% FN rate equates to image spams at all closely. I haven't been running without net tests in a while, but I think I'd well believe that without at least FuzzyOCR and the SARE stock rules, 30% would not be unreasonable. btw, it's definitely not that the results are poor for scoreset 2 in general. for example, the "pre" result looks like this: SUMMARY for threshold 5.0: Correctly non-spam: 538869 99.62% Correctly spam: 898436 94.22% False positives: 2034 0.38% False negatives: 55109 5.78% TCR(l=50): 6.080933 SpamRecall: 94.221% SpamPrec: 99.774% yet the "post" result is *worse*: # SUMMARY for threshold 5.0: # Correctly non-spam: 67241 99.54% # Correctly spam: 109483 91.94% # False positives: 309 0.46% # False negatives: 9600 8.06% # TCR(l=50): 4.753812 SpamRecall: 91.938% SpamPrec: 99.719% that's actually about the best result I've been able to get for set2 with the perceptron. It would be logical to assume that the perceptron could find similar results to the "pre" set, but it's not doing so... (re that 30% -- much of that may still be a result of this bug btw) I also wonder if perhaps we need to do some more corpus cleaning and verifying that we're not doing gigo. best results I've gotten out of the perceptron so far for set 2 have been: # SUMMARY for threshold 5.0: # Correctly non-spam: 67468 99.88% # Correctly spam: 112527 94.49% # False positives: 82 0.12% # False negatives: 6556 5.51% # TCR(l=50): 11.175206 SpamRecall: 94.495% SpamPrec: 99.927% with intermittent cases where it just goes nuts and gives 30% FNs (with most of the scores zeroed). I can't particularly figure out what needs to be done with the parameters to work around it, nor am I keen to sit here trying sets of params in a trial-and-error fashion... I'm going to go back to using the GA, see if I can get better results out of that. if anyone wants a try to fix the perceptron problems for set0 and set2, the full logs are on the zone at: -rw-rw-r-- 2 jm other 1446756927 Feb 7 22:20 /export/home/jm/ftp/spamassassin/masses/spam-full.log -rw-rw-r-- 2 jm other 413512085 Feb 7 22:15 /export/home/jm/ftp/spamassassin/masses/ham-full.log Theo -- I'm pretty sure it's not a GIGO logs problem, since the results for set 1 and set 3 are quite good, and the freqs look good too. actually, I'll upload the freqs for set3, they're worth checking out. Created attachment 3864 [details]
freqs
HELO_LOCALHOST, a bit of a surprise winner ;)
yay for the GA! I resurrected craig-evolve.c and ran it -- here's the test results from its run: # SUMMARY for threshold 5.0: # Correctly non-spam: 67498 99.92% # Correctly spam: 115160 96.71% # False positives: 52 0.08% # False negatives: 3923 3.29% # TCR(l=50): 18.255864 SpamRecall: 96.706% SpamPrec: 99.955% compare with the best results for the perceptron on the same data set, after a *lot* of futzing with settings: # SUMMARY for threshold 5.0: # Correctly non-spam: 67468 99.88% # Correctly spam: 112527 94.49% # False positives: 82 0.12% # False negatives: 6556 5.51% # TCR(l=50): 11.175206 SpamRecall: 94.495% SpamPrec: 99.927% on the other hand, it took 8 hours to produce the GA results. ;) but still, a hell of a lot better... and it didn't require any tweaking or manual knob-twiddling, it's just fire and forget. I'll use the GA for the other scoreset (0), and maybe try it again on sets 1 and 3 to see if it can beat the perceptron FP%/FN% rates for those too. for the record; the FP/FN test results from the set 1 perceptron run were: Correctly non-spam: 539494 99.74% Correctly spam: 914430 95.90% False positives: 1409 0.26% False negatives: 39115 4.10% TCR(l=50): 8.703007 SpamRecall: 95.898% SpamPrec: 99.846% ok, set 0 scores now in: # SUMMARY for threshold 5.0: # Correctly non-spam: 66964 99.13% # Correctly spam: 110426 92.73% # False positives: 586 0.87% # False negatives: 8657 7.27% # TCR(l=50): 3.137313 SpamRecall: 92.730% SpamPrec: 99.472% phew, no 30% fn rate after all. ;) for the record, the gen files are in: gen-set0-2.0-5.0-100-ga GA results for set 1: # SUMMARY for threshold 5.0: # Correctly non-spam: 67347 99.70% # Correctly spam: 114907 96.49% # False positives: 203 0.30% # False negatives: 4176 3.51% # TCR(l=50): 8.312369 SpamRecall: 96.493% SpamPrec: 99.824% gen-set1-5.0-5.0-100-ga is the gen dir. GA results for set 3: # SUMMARY for threshold 5.0: # Correctly non-spam: 67494 99.92% # Correctly spam: 117606 98.76% # False positives: 56 0.08% # False negatives: 1477 1.24% # TCR(l=50): 27.842647 SpamRecall: 98.760% SpamPrec: 99.952% beats the perceptron's 1.70% FNs nicely ;) gen-set3-5.0-5.0-100-ga is the dir. (in passing, I used my $400 Dell laptop to produce set 1 last night. it completed the GA run a lot faster than the zone did. The zone doesn't really provide decent CPU power any more.) so, that's the lot (finally!). to summarise, the results on the test sets are: set 0 # Correctly non-spam: 66964 99.13% # Correctly spam: 110426 92.73% # False positives: 586 0.87% # False negatives: 8657 7.27% # TCR(l=50): 3.137313 SpamRecall: 92.730% SpamPrec: 99.472% set 1 # Correctly non-spam: 67347 99.70% # Correctly spam: 114907 96.49% # False positives: 203 0.30% # False negatives: 4176 3.51% # TCR(l=50): 8.312369 SpamRecall: 96.493% SpamPrec: 99.824% set 2 # Correctly non-spam: 67498 99.92% # Correctly spam: 115160 96.71% # False positives: 52 0.08% # False negatives: 3923 3.29% # TCR(l=50): 18.255864 SpamRecall: 96.706% SpamPrec: 99.955% set 3 # Correctly non-spam: 67494 99.92% # Correctly spam: 117606 98.76% # False positives: 56 0.08% # False negatives: 1477 1.24% # TCR(l=50): 27.842647 SpamRecall: 98.760% SpamPrec: 99.952% please take a look at 50_scores.cf and see if you can spot any issues. The one thing I can see is that we now have lint failures in trunk, because there are scores in 50_scores.cf for rules from rulesrc. I'm not sure how to solve that... either: - (a) stop "score set for unknown rule name" being a lint error that is warned about, or - (b) go through the rulesrc tree, finding the rules that were in the active list and therefore which now have scores, and mark them with "tflags publish" so they are always published to the active ruleset. I'm leaning towards (b). (In reply to comment #20) > please take a look at 50_scores.cf and see if you can spot any > issues. Regarding the RCVD_IN_DNSWL_* rules (disclaimer: I'm involved with the dnswl.org project): 1) RCVD_IN_DNSWL_HI does not have a score. Could it be because it only has very few samples in my corpus and most likely not very many in other peoples' corpora? Would it make sense to add a score for _HI? 2) RCVD_IN_DNSWL_LOW received a better score (-0.698) than RCVD_IN_DNSWL_MED (-0.498). This seems counter-intuitive, since *_LOW contains servers which may also emit a certain amount of spam. Would it make sense to leave the scores at -1, -4 and -8 as Theo had them in his sandbox? http://svn.apache.org/viewvc/spamassassin/rules/trunk/sandbox/felicity/70_dnswl.cf?view=markup (As a reminder: the LOW, MED and HI indicate the trustworthiness assigned to a particular source.) Created attachment 3868 [details] new scores Here's a copy of the new scores (easier to look at as an attachment!) the logs used are now archived at: -rw-rw-r-- 1 jm other 105312770 Feb 18 11:08 /home/corpus-rsync/ARCHIVE/3.2.0/rescore-logs-bug5270.tgz on the zone. > Regarding the RCVD_IN_DNSWL_* rules (disclaimer: I'm involved with the > dnswl.org project): > > 1) RCVD_IN_DNSWL_HI does not have a score. Could it be because it only has > very few samples in my corpus and most likely not very many in other > peoples' corpora? Would it make sense to add a score for _HI? Hmm. interesting! I think it's using a score of -1.0 (the default for a "nice" rule). Because there are so few hits, the scoring system didn't feel confident enough to assign a score, I think. > 2) RCVD_IN_DNSWL_LOW received a better score (-0.698) than RCVD_IN_DNSWL_MED > (-0.498). This seems counter-intuitive, since *_LOW contains servers which > may also emit a certain amount of spam. Would it make sense to leave the > scores at -1, -4 and -8 as Theo had them in his sandbox? > http://svn.apache.org/viewvc/spamassassin/rules/trunk/sandbox/felicity/70_dnswl.cf?view=markup Yes, it may be worth manually resetting these -- right now, they are not really going to have any effect. by the way similarly, it looks like the IADB rules have also been effectively zeroed, mainly due to their scores being manually set in rulesrc/sandbox/felicity/70_iadb.cf (I failed to spot them!): 0.104 0.0008 0.2872 0.003 0.64 -0.00 RCVD_IN_IADB_LISTED 0.090 0.0000 0.2487 0.000 0.63 -0.00 RCVD_IN_IADB_EPIA 0.092 0.0008 0.2531 0.003 0.63 -0.00 RCVD_IN_IADB_SPF 0.091 0.0008 0.2487 0.003 0.63 -0.00 RCVD_IN_IADB_OPTIN_GT50 0.086 0.0000 0.2383 0.000 0.62 -0.00 RCVD_IN_IADB_SENDERID 0.086 0.0000 0.2369 0.000 0.62 -0.00 RCVD_IN_IADB_EDDB 0.011 0.0000 0.0296 0.000 0.52 -0.00 RCVD_IN_IADB_OPTIN 0.006 0.0008 0.0148 0.054 0.51 -0.00 RCVD_IN_IADB_UT_CPR_MAT 0.005 0.0008 0.0133 0.059 0.51 -0.00 RCVD_IN_IADB_MI_CPR_MAT 0.008 0.0017 0.0178 0.086 0.51 -0.00 RCVD_IN_IADB_RDNS 0.001 0.0000 0.0015 0.000 0.50 -0.00 RCVD_IN_IADB_OPTIN_LT50 0.001 0.0000 0.0015 0.000 0.50 -0.00 RCVD_IN_IADB_DOPTIN_GT50 0.001 0.0000 0.0015 0.000 0.50 -4.00 RCVD_IN_IADB_DOPTIN 0.001 0.0000 0.0015 0.000 0.50 -0.00 RCVD_IN_IADB_DOPTIN_LT50 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_UT_CPEAR 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_UT_CPR_30 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_LOOSE 0.000 0.0000 0.0000 0.500 0.50 -6.00 RCVD_IN_IADB_ML_DOPTIN 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_DK 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_NOCONTROL 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_UNVERIFIED_1 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_MI_CPEAR 0.000 0.0000 0.0000 0.500 0.50 -2.20 RCVD_IN_IADB_VOUCHED 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_GOODMAIL 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_MI_CPR_30 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_OPTOUTONLY 0.000 0.0000 0.0000 0.500 0.50 -0.00 RCVD_IN_IADB_UNVERIFIED_2 0.000 0.0000 0.0000 0.500 0.50 -8.00 RCVD_IN_IADB_OOO assuming there's no other negative comments, and I get a chance, I'll trying fixing those RCVD_IN_DNSWL scores asap to match what Matthias suggested. (If those values affect accuracy rates noticeably though I may tweak them to be closer to zero.) Do we want to assign "real" scores for those IADB rules, or are the scores from Theo's sandbox file ok? (I assume they're OK.) I will then generate the STATISTICS files and check those in. Finally, I'll add "tflags publish" to the published rules from rulesrc as noted in comment 20. Once that's done, this bug can be closed... shout now if you disagree about this plan, of course ;)
> Finally, I'll add "tflags publish" to the published rules from rulesrc as noted
> in comment 20.
actually, maybe it makes more sense to just cut those rules out of the
rulesrc/sandbox files, and move them into a new file in rules/ . hmm.... let's
see how it goes.
*** Bug 5343 has been marked as a duplicate of this bug. *** (In reply to comment #23) > Finally, I'll add "tflags publish" to the published rules from rulesrc as noted > in comment 20. ok, I've now done this. It seems to work OK... however, people with sandbox rules that were "good enough" and included in the evolved scoreset now need to be careful to update rules/50_scores.cf if they change/remove those rules! (I also fixed a couple of minor issues in the imageinfo meta rules; the evolver had disabled some of the rules the metas relied on.) as follow-up to my duplicate@5343, up'ing to r510010, after a, make distclean & full rebuild, make test now reports, All tests successful, 17 tests skipped. Files=127, Tests=1807, 6406 wallclock secs (3236.97 cusr + 697.55 csys = 3934.52 CPU) so, for me, that leaves only #5340 ... thanks. annoyingly, the basic FP/FN rate when I run fp-fn-statistics *now* for set 3 is 0.1% higher in FNs than when I ran it during the score generation :( need to figure out wtf is up there before I can twiddle the IADB and DNSWL scores. > annoyingly, the basic FP/FN rate when I run fp-fn-statistics *now* for set 3 is
> 0.1% higher in FNs than when I ran it during the score generation :(
>
> need to figure out wtf is up there before I can twiddle the IADB and DNSWL scores.
aha, I think I have it. there are certain rules like FM_MORTGAGE6PLUS that hit
enough mail in the nightly-mass-check to be promoted, but didn't hit any (recent
enough?) mail in the rescoring check to get into the perceptron input:
masses/gen-set0-2.0-5.0-100-ga/freqs: 0.000 0.0003 0.0000 1.000 0.51
0.00 FM_MORTGAGE6PLUS
masses/gen-set0-2.0-5.0-100-ga/make.output:rule FM_MORTGAGE6PLUS: immutable and
zero due to low hitrate
masses/gen-set0-2.0-5.0-100-ga/make.output:ignoring 'FM_MORTGAGE6PLUS': score
and range == 0
so this then got ignored by the perceptron. however, the hits are still in the
logs, and the rule is still in 72_active.cf. after the perceptron completes,
rewrite-scores does not add a line to 50_scores.cf for this rule (because it's
not in the scores file output by perceptron).
when fp-fn-statistics is run, later, parse-rules-for-masses is run in turn,
generating a rules.pl file containing:
'FM_MORTGAGE6PLUS' => {
'lang' => '',
'score' => '1',
'describe' => 'Looks like a mortgage spam (6+)',
'tflags' => '',
'type' => 'meta',
'issubrule' => '0',
'mutable' => 1,
'eval' => '0',
'depends' => [
'__FM_MORTGAGE6PLUS'
],
'code' => '(__FM_MORTGAGE6PLUS)'
},
note: a score of 1! this is because the rule exists in 72_active.cf, but has no
score in 50_scores.cf. fp-fn-statistics then uses that to compute its accuracy
rates.
this then accounts for the difference, I'd say; a few rules like that with
0.0003% hitrates, and scores changing from 0.0 to 1.0, could add up to ~0.1% FN
improvement and ~0.01% additional FPs.
to fix: we need to indicate that these rules were immutable and zeroed, so that
rewrite-scores will add a score of 0 for them to 50_scores.cf after the
perceptron is run.
another possible issue. Was ALL_TRUSTED supposed to be mutable? ifplugin Mail::SpamAssassin::Plugin::RelayEval # <gen:mutable> score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800 ... # </gen:mutable> # Informational rules about Received header parsing score NO_RELAYS -0.001 score UNPARSEABLE_RELAY 0.001 I'd consider it a whitelisting rule, based on user configuration -- therefore nonmutable. false alarm; looks like it's always been mutable! (not that I'm sure that's a good idea, but it'd be a separate issue. ;) I think I've fixed the "low-scoring rule gets 1.0 default score" bug now. Also note a new issue from bug 5110 -- EXTRA_MPART_TYPE was given too high a score. locking that lower. Now to rebuild the FP-FN-stats and STATISTICS files using those scores... ok; EXTRA_MPART_TYPE set to 1.0, IADB rules fixed (the important ones at least) to the rulesrc scores, and DNSWL scores reinstated. Effects: in all scoresets, the FP% percentage drops and FN% goes up a tiny amount; in all cases, though, the TCR went up, so I think it's worth it. Here are the final results... set 0 # Correctly non-spam: 67074 99.30% # Correctly spam: 109992 92.37% # False positives: 476 0.70% # False negatives: 9091 7.63% # TCR(l=50): 3.620534 SpamRecall: 92.366% SpamPrec: 99.569% set 1 # Correctly non-spam: 67391 99.76% # Correctly spam: 114172 95.88% # False positives: 159 0.24% # False negatives: 4911 4.12% # TCR(l=50): 9.259233 SpamRecall: 95.876% SpamPrec: 99.861% set 2 # Correctly non-spam: 67507 99.94% # Correctly spam: 114861 96.45% # False positives: 43 0.06% # False negatives: 4222 3.55% # TCR(l=50): 18.688481 SpamRecall: 96.455% SpamPrec: 99.963% set 3 # Correctly non-spam: 67503 99.93% # Correctly spam: 117457 98.63% # False positives: 47 0.07% # False negatives: 1626 1.37% # TCR(l=50): 29.950453 SpamRecall: 98.635% SpamPrec: 99.960% ok, I may need to get the brown bag out... as noted in bug 5285, we needed to reuse the T_RCVD_IN_PBL_WITH_NJABL_DUL hits as RCVD_IN_PBL hits for the set 1 and set 3 GA runs. I forgot to do this :( As a result, RCVD_IN_PBL was given a score of basically zero. I'm rerunning the GA for set 1 and set 3 now, with s/T_RCVD_IN_PBL_WITH_NJABL_DUL/RCVD_IN_PBL/g . It looks likely to be a little bit ahead of the current set1 / set3 results... let's see. It'll certainly have a more useful score for RCVD_IN_PBL (although probably less than 1.0). after re-running set 1 and set 3 with PBL working this time, I get set 1: # Correctly non-spam: 67386 99.76% # Correctly spam: 114216 95.91% # False positives: 164 0.24% # False negatives: 4867 4.09% # TCR(l=50): 9.113262 SpamRecall: 95.913% SpamPrec: 99.857% set 3: # Correctly non-spam: 67508 99.94% # Correctly spam: 117293 98.50% # False positives: 42 0.06% # False negatives: 1790 1.50% # TCR(l=50): 30.612596 SpamRecall: 98.497% SpamPrec: 99.964% gen-set1-5.0-5.0-100-pblfix and gen-set3-5.0-5.0-100-pblfix are the 2 gen dirs used. PBL scores come out as: score RCVD_IN_PBL 0 0.509 0 0.905 # n=0 n=2 and I've verified that EXTRA_MPART_TYPE, FM_MORTGAGE6PLUS, the DNSWL and IADB rules are all working correctly as before. (also, updated /home/corpus-rsync/ARCHIVE/3.2.0/rescore-logs-bug5270.tgz .) FWIW, ALL_TRUSTED is in a <gen:mutable> section, but immutable by score-ranges-for-freqs due to tflags userconf. |