SA Bugzilla – Bug 5376
RFE: generate a "SpamAssassin Challenge" score-generation test
Last modified: 2022-03-06 16:25:29 UTC
Talking to Henry, he suggested that we should define an RFP for the SpamAssassin score-generation tool (ie. something like the GA or the perceptron). Thinking about this, I thought -- why not go bigger than that? The Netflix Prize http://www.netflixprize.com/ is a machine-learning challenge which 'seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.' Basically they provide a training set of data, and some basic rules ( http://www.netflixprize.com/rules ). Third parties can then come up with algorithms and implementations based on that data, and provide a set of "scores" which Netflix then run on a hidden "test set" to estimate their accuracy. Netflix then publishes the results, and awards a prize for the best-performing team. We could do something similar for SpamAssassin... probably not with a prize at the end of it, but still, I'd say there'd be some interest. (Our rules would include stuff like score ranges: ie. low-hit-rate rules and high-FP rules cannot have scores over N; rules that are forgeable cannot have negative scores; etc.)
ok, I wrote up a spec of what we currently require: http://wiki.apache.org/spamassassin/SpamAssassinChallenge does this look like a reasonable idea? Duncan?
Should there be something in there about code that can be used without any sort of patent issues and available under the Apache License?
I think our current "generate scores once a millenia" method is not really functional for several reasons, but mainly that it takes too long between updates and the scores are overly tuned for our messages and not necessarily what other people see (despite having 42 rsync accounts, we have few (8 (but really more like 6.5) at current count) people doing nightly runs, and I'm ~50% of the total results). Also, the last thing I saw about the perceptron, and why it didn't work for 3.2, was that it needed more diverse data to score with. Perhaps we just need better rules? Anyway, coming up with a new scoring algorithm isn't bad, I just don't think it's going to make a lot of difference without other changes as well, such as more frequent runs, more corpus runs/results from the masses, and/or a way for individual locations to easily update their own scores.
(In reply to comment #3) > I think our current "generate scores once a millenia" method is not really > functional for several reasons, but mainly that it takes too long between > updates and the scores are overly tuned for our messages and not necessarily > what other people see (despite having 42 rsync accounts, we have few (8 (but > really more like 6.5) at current count) people doing nightly runs, and I'm ~50% > of the total results). > > Also, the last thing I saw about the perceptron, and why it didn't work for 3.2, > was that it needed more diverse data to score with. Perhaps we just need better > rules? well, the GA did pretty well, so we came to the conclusion that the data was ok -- just not the kind of data the perceptron's gradient-descent algorithm did well with. > Anyway, coming up with a new scoring algorithm isn't bad, I just don't think > it's going to make a lot of difference without other changes as well, such as > more frequent runs, more corpus runs/results from the masses, and/or a way for > individual locations to easily update their own scores. sure. The benefit of this, however, is that we can define the problem and then *other people* will do it for us ;) Michael, good point.
Unfortunately it's been a while since I've looked at this stuff. (Actually, it's been like 3 months... which is hardly a while, but it's been a busy 3 months...) In no particular order: - BAYES_* are marked as immutable right now (IIRC). This really limits optimization in score sets 2 and 3. - Score ranges need to be better defined. (Perhaps require that entries fit in current score ranges?) If we don't clearly define/restrict score range, the best submission will probably be the one with the least restricted scores. Score ranges prevent scores from being over-optimized to our data set. Splitting our data set into training and test sets doesn't really catch this over-optimization, since both are part of our data set that has unique characteristics. (I'm sure there are technical terms for this, I just don't remember what they are...) - I already have a copy of the test set (if I can find it). Does that make me ineligible? :-) - By requiring scores in the current format, we are eliminating a whole class of scoring systems. For example, suppose I wanted to try a decision tree system to detect spam based on SpamAssassin rules (this would obviously work very poorly), it would be impossible to convert this into a set of scores. - The LR experiments Steve and I did relied on a logarithmic decision rule (i.e. a message is spam if 1 / 1 + exp^(-(scores * rule_hits)) > probability threshold). This is easy to convert into traditional SpamAssassin scores using algebra, but other systems may not be. - If we scrap the requirements for output to be in terms of current SpamAssassin scores, our score ranges problem becomes more significant -- score ranges don't mean anything if we're not talking about traditional SpamAssassin scores. - Ask me if this isn't clear -- it's tricky to explain. - Our evaluation criteria is currently undefined. We need a clear, single measurement to decide on a winner. (In our research, we used TCR on the test set with lambda = 50 as our "goal" criteria.) Depending on how/if we resolve the previous point, we need to set a threshold value (for example 5.0) as our sole test point. - Do you think people are actually going to be interested in this enough in order to devote a good chunk of time toward it? I hope so... Makes me think I should have submitted a talk to ApacheCon... it'd be a great way to kick off this contest.
Oh, also, should we have any requirements on runtime? How automated the process needs to be when it is submitted. Our experiments in LR, for example, contain a fairly time consuming manual step right now :-) This sort of thing could probably be worked out after we select a system, but guidelines might be good here.
(In reply to comment #5) > - BAYES_* are marked as immutable right now (IIRC). This really limits > optimization in score sets 2 and 3. I think we do want to keep them immutable, though -- when we made them mutable, it caused lots of user confusion and FAQs. it makes the rescoring job a little harder, but dealing with user complaints is a real pain! > - Score ranges need to be better defined. (Perhaps require that entries fit in > current score ranges?) I dunno, I don't think the current ones are all that great anyway ;) I'd bet people could do better. We've certainly tweaked them repeatedly ourselves in the past... ;) I take your point about wanting to block submissions with totally unrestricted scores though; I tried to do that with what's in the current doc. Can you suggest wording that goes a bit further? > If we don't clearly define/restrict score range, the best > submission will probably be the one with the least restricted scores. Score > ranges prevent scores from being over-optimized to our data set. Splitting our > data set into training and test sets doesn't really catch this > over-optimization, since both are part of our data set that has unique > characteristics. (I'm sure there are technical terms for this, I just don't > remember what they are...) Maybe we'd need to use different submitter accounts for the train and test sets? So the test set is made up of mail from an entirely different submitter or set of submitters? We could do that... > - I already have a copy of the test set (if I can find it). Does that make me > ineligible? :-) Just don't look at it ;) Seriously though, I don't know how we could avoid that problem. By definition, any member of the SpamAssassin PMC can look at the test set if they want, due to how the project governance and oversight works. All we can do is work on trust I think. > - By requiring scores in the current format, we are eliminating a whole class of > scoring systems. For example, suppose I wanted to try a decision tree system to > detect spam based on SpamAssassin rules (this would obviously work very poorly), > it would be impossible to convert this into a set of scores. I think this is acceptable. I don't think we would want to replace the current scoring system with a more complex system built around Bayesian probability combining, decision trees, or anything else, without a *lot* more analysis and discussion, whereas this "Challenge" is more likely to produce something like the perceptron; a drop-in replacement for the offline rescoring component (I hope). > - Our evaluation criteria is currently undefined. We need a clear, single > measurement to decide on a winner. (In our research, we used TCR on the test set > with lambda = 50 as our "goal" criteria.) Depending on how/if we resolve the > previous point, we need to set a threshold value (for example 5.0) as our sole > test point. Well, I had a thought during the 3.2.0 scoring. We actually have two constraints: - results have to be below a certain FP% rate (around 0.4%) if at all possible (we wound up going over this for set 0, but we had no alternative :( ) - and as low an FN% rate as possible, given the above. So in other words, that's two constraints: TCR *and* a cut-off point for the FP% rate. does that sound ok? And for purposes of the challenge -- obviously, it'd need to be better than what we currently get, too. ;) > - Do you think people are actually going to be interested in this enough in > order to devote a good chunk of time toward it? I hope so... it'd make a good project I think! I've received several emails in the past about the idea, including Gordon's most recently, and it'd be natural way for machine-learning students to get Google Summer of Code funding. > Makes me think I should have submitted a talk to ApacheCon... it'd be a great > way to kick off this contest. yeah definitely... > Oh, also, should we have any requirements on runtime? How automated the > process needs to be when it is submitted. Our experiments in LR, for example, > contain a fairly time consuming manual step right now :-) This sort of thing > could probably be worked out after we select a system, but guidelines might > be good here. Good point. It'd need to be "fire and forget" automated, as much as the GA is right now. Hand-tweaking is hard work, timeconsuming, and error-prone in my experience...
Re: Bayes rules. No, they should not be immutable. If you want, we can require them to be "sane" for some definition of sane. There's no compelling reason for them to be the exact values they are. Re: Specifying Score ranges This is pretty tricky, I'm not sure what we can do here. The problem is that score ranges are inherently non-mathematical, really more of a shot in the dark, and there's no real way to evaluate them. Having a different corpus form the test set is probably a better real-world test of the algorithm, but it also adds a whole lot of luck (I think). If we were evaluating *algorithms* rather than submitted sets of scores, we could try something like a cross-fold validation but split into folds based on corpus as you suggest. I don't know if it would work in practice (or even in theory for that matter). Re: Evaluation I'm not entirely sure what you're trying to say. Specifying a max FP rate and minimizing FNs given that rate is not the same as minimizing TCR. TCR builds in a relative cost of FPs and FNs (namely lambda), and is probably a simpler criteria. I don't think we'll see really high FPs if we are trying to obtain optimal TCR with lambda = 50. (Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN = 0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).) Re: Fire and Forget In development of these algorithms, it's much easier to get a process down then automate it later. If we don't want to see/evaluate the results before requiring people to have a fully automated system, we might be wasting effort.
(In reply to comment #8) > Re: Bayes rules. > No, they should not be immutable. If you want, we can require them to be "sane" > for some definition of sane. There's no compelling reason for them to be the > exact values they are. Possibly not the exact values they are right now. But I think we have to disagree that they need to be immutable; I really would prefer that they are. Other developers' thoughts would be welcome here... > Re: Specifying Score ranges > This is pretty tricky, I'm not sure what we can do here. The problem is that > score ranges are inherently non-mathematical, really more of a shot in the dark, > and there's no real way to evaluate them. Having a different corpus form the > test set is probably a better real-world test of the algorithm, but it also adds > a whole lot of luck (I think). If we were evaluating *algorithms* rather than > submitted sets of scores, we could try something like a cross-fold validation > but split into folds based on corpus as you suggest. I don't know if it would > work in practice (or even in theory for that matter). maybe that would be a good additional test step. I agree 10fcv is really the best way to check out an algorithm's workability... > Re: Evaluation > I'm not entirely sure what you're trying to say. Specifying a max FP rate and > minimizing FNs given that rate is not the same as minimizing TCR. TCR builds in > a relative cost of FPs and FNs (namely lambda), and is probably a simpler > criteria. I don't think we'll see really high FPs if we are trying to obtain > optimal TCR with lambda = 50. > > (Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN = > 0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).) ah, here's another problem with TCR I'd forgotten about. it varies based on the size of the corpora: : jm 372...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 1000 -ham 1000 # TCR(l=50): 33.500000 : jm 373...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 2000 -ham 2000 # TCR(l=50): 66.833333 : jm 374...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 3000 -ham 3000 # TCR(l=50): 100.166667 (I can't believe I forgot about that!) I think we should avoid it in general; if we can't compare results because the corpus changes size, that's a bad thing. I know it's nice to have a single figure, but I haven't found a good one yet; nothing that's as comprehensible or easy to use as FP%/FN%. > Re: Fire and Forget > In development of these algorithms, it's much easier to get a process down then > automate it later. If we don't want to see/evaluate the results before requiring > people to have a fully automated system, we might be wasting effort. ok, maybe so.
Re: Bayes immutability All I'm really trying to say is that during scoring runs, we should be changing the BAYES_ scores. We can manually make them "sane" if necessary, but we need to change the values periodically to reflect the changing importance of the Bayes rules relative to other rules. (In our LR research, we would have given BAYES_99 a score of >6 if we could have, so realistically a score of 4.5 would be fair for BAYES_99. The best way to determine what it should be is using a scoring mechanism.) I can agree to disagree. Re: TCR TCR = number of spam / (number of fns + lambda * number of fps) If you remember what TCR represents, it makes sense that TCR depends on the relative ratio of ham/spam of the corpus. (If you don't, http://wiki.spamassassin.org/TotalCostRatio -- but the wiki page really complicates the calculation) It's still fine for ranking different algorthims on the same corpus, but if you're upset about it, I propose the following new measurement. Let's call it the Findlay measurement: F(lambda) = 1 / (FN% + FP% * lambda) (Actually as defined above it's a function.) This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the "undesirable" properties you mentioned. (Ok, don't call it the Findlay measurement... it's a stupid name, and it's a fairly trivial derivation...)
From a user point of view Bayes doesn't need to be immutable, but it does need to be monotonic, and the 99 rank needs to be a major fraction of the spam score, and 00 should be pretty much the negative of that. So having bayes_99 be 6.5 or 4.5 or 8 is fine, but having it be 3.2 or 2.6 isn't fine. Of course from a computational point of view a score range restriction like that is probably worse to deal with than a fixed score, since it guarantees you an infinite number of possible solutions. But from a human point of view it is a nice concept.
here's another machine-learning "Challenge" -- http://challenge.spock.com/pages/learn_more $50k prize on this one. I doubt we could match that ;)
(In reply to comment #10) > I propose the following new > measurement. Let's call it the Findlay measurement: > > F(lambda) = 1 / (FN% + FP% * lambda) > > (Actually as defined above it's a function.) > > This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the > "undesirable" properties you mentioned. it's not bad, but it needs work I'm afraid. For example, take a look at this: fn=3 fp=1.5 f=0.0128205128205128 fn=18 fp=1.2 f=0.0128205128205128 in other words, if the FP% in this case dropped from 1.5% to 1.2%, that would be enough to disguise the FN% rate dropping from 3% to *18%*. ;)
That's exactly right, one FP% is worth 50 FN%, if you set lambda = 50. If you don't like that tradeoff, pick a different value for lambda. :-) I agree it gets ugly when you have that many FPs...
I should point out also that where I wrote "FN%" and "FP%" I really meant the proportion of FN/FPs and not percentage, so you need to divide the percentage by 100 before sticking it in the formula (or multiply the end result by 100). You end up with F=1.28, not F=0.0128.
Btw, the CEAS 2007 contest used the following metrics: Filters will be evaluated using the lam() metric. Lam() calculates the average of a filter's false-negative and false-positive rates, but performs the calculation in logit space. The exact formula used for the competition will be: FPrate = (#ham-errors + 0.5) / (#ham + 0.5) FNrate = (#spam-errors + 0.5) / (#spam + 0.5) lam = invlogit((logit(FPrate) + logit(FNrate)) / 2) where logit(p) = log(p/(1-p)) invlogit(x) = (exp(x))/(1 + exp(x)) The winner will be the filter with the lowest lam() score. Here is an additional comment from Gordon V. Cormack (on the ceas-challenge@eight.pairlist.net ML): Lam is equivalent to diagnostic odds ratio which is used extensively in the diagnostic test literature. Intuitively, it rewards a fixed factor improvement in either spam or ham missclassification rate by the same amount. So improving from .001 to .0009 fp gives the same score gain as from .01 to .009 fn. Lam is *almost* the same thing as the geometric mean of the two scores, which is perhaps more intuitive for most. Lam is discussed in the TREC 2005 proceedings http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/ and DOR in the medical literature http://www.bmj.com/cgi/content/full/323/7305/157 In a nutshell, we are looking for a threshold- independent measure that roughly captures the overall discriminative ability of a filter. It turns out that lam works pretty well for this. Better than, say "accuracy" which is wildly threshold dependent; optimizing accuracy may be totally inconsistent with good filtering. While lam does not explicitly reward low fp more than low fn, it does not actively discourage it. Other measures -- like ROC Area Under the Curve -- could have been used but that would have required filters to return scores rather than categorical ham/spam judgements and we considered this logistically unrealistic.
lam sounds promising -- let's look into that. btw there's another issue with the F(lambda) idea -- if FP and FN are 0 -- ie there were no misclassifications, which can happen on small test corpora -- it yields a division by zero.
> if FP and FN are 0 -- ie there were no misclassifications [...] > -- it yields a division by zero That's an idea! Implement it and spammers will start producing perfectly classifiable spam in an attempt to crash the filters with divide by zero errors. We can foil them by leaving in a small bug to make perfection impossible, but stay as close to perfect as we want!
Re: divide by 0 That's a computational issue, not a theoretical one. If there are no FPs or FNs, then it makes sense for TCR (or F) to be infinite. You just need to remember what TCR is, a ratio between the costs of dealing with spam without and with a spam filter. (http://wiki.spamassassin.org/TotalCostRatio)
(In reply to comment #19) > Re: divide by 0 > That's a computational issue, not a theoretical one. If there are no FPs or FNs, > then it makes sense for TCR (or F) to be infinite. it may make sense from the POV of TCR as an abstract concept, but it doesn't make much sense for us to use it if it breaks like that. ;)
lam() avoids the problem noted in comment 13, but has another problem; it doesn't have any concept of an FP being worse than an FN. This means that e.g. fp%=1.5 fn%=5 and fp%=5 fn%=1.5 have the same score, whereas the latter is much worse for our users.
here's a version of lam() with a lambda calculation... #!/usr/bin/perl # http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376#c16 my ($lambda, $fppc, $fnpc, $nspam, $nham) = @ARGV; my $fprate = ((($fppc * $nham) / 100) + 0.5) / ($nham + 0.5); my $fnrate = ((($fnpc * $nspam) / 100) + 0.5) / ($nspam + 0.5); sub logit { my $p = shift; return log($p / (1-$p)); } sub invlogit { my $x = shift; return exp($x) / (1 + exp($x)); } my $llam = invlogit (($lambda * logit($fprate) + logit($fnrate)) / ($lambda + 1)); print "Llam(l=$lambda, fp=$fppc, fn=$fnpc, ns=$nspam, nh=$nham): $llam\n"; some results: Llam(l=10, fp=1, fn=5, ns=10000, nh=10000): 0.011653428823489 Llam(l=10, fp=2, fn=5, ns=10000, nh=10000): 0.0218100293576761 Llam(l=10, fp=5, fn=1, ns=10000, nh=10000): 0.0433911604792563 so it avoids the problem with original lam(). however: Llam(l=10, fp=1.5, fn=5, ns=10000, nh=10000): 0.0168115579465237 Llam(l=10, fp=1.0, fn=20, ns=10000, nh=10000): 0.0134020941849576 I think this is a problem. IMO a 20% FN rate/1.0% FP rate should not score better than 5% FN/1.5% FP. it's good drop of the FP rate, sure -- but a filter with 20% FNs is unusable. :( so: TCR: varies widely based on size of corpora F(): good FP rates can mask terrible FN rates lam(): treats FPs and FNs as equal, no concept of lambda Llam(): again, good FP rates can mask terrible FN rates we still don't have a good single-figure metric imo.
I think we should go ahead with using FP% and FN% -- a double-figure metric -- instead of any single figure metric. I don't think any of the single-figure metrics perform well enough in all situations to match FP%/FN%'s expressiveness.
Closing ancient stale bug.