SA Bugzilla – Bug 1050
proposal: allow transforms for tests
Last modified: 2004-11-30 10:10:25 UTC
Proposal to add a new type of test: word sequence. body tests generally use whitespace between words as a delimiter. For a word sequence test, we could use one or more tr/// operations (different ones for different rules) before running the test. Perhaps the tr/// operations could be generalized, but there may be some issues such as internationalization that would need to be worked out. The tests would work something like the current spam phrase engine: tr/A-Za-z/ /cs; tr/A-Z/a-z/; for (split) { test internals } Except the tr// sequence would be specific to each rule and then the split could just operate on the second argument of the tr//cs operation. In this case, spaces. Now for rules. We have a few ways to go here, but I suspect sticking with perl regular expressions would be the best bet since that's what we're used to writing. Then this is really just a transform done before "body" tests. For example, take a simple word sequence test where the string is long enough that we don't really care about whitespace or other intervening characters. original: body USE_IDENTITY /using your identity/i new: body USE_IDENTITY tr/A-Za-z//cd; /usingyouridentity/i Another, more cautious version would use non-letters as delimiters to arrive at this: body USE_IDENTITY tr/A-Za-z/ /cs; /using your identity/i If certain transforms become common, we could use aliases, for example: body USE_IDENTITY EN_SPACE; /using your identity/i We could also allow s/// substitutions, but that would have more overhead and we should avoid it if possible. tr/// is much faster. Already with this proposal, we would have to make a copy of each line being tested to avoid changing it. In case it isn't obvious why I am proposing this, see bug 1047.
*** Bug 1047 has been marked as a duplicate of this bug. ***
I just want to add my vote of support for coming up with some way around the non-whitespace problem. I'm starting to see a lot more spam with every word in the subject in squotes, sometimes even doubled squotes. Here's an example: Subject: ''Free'' Shipping 'offer' - Inkjet ''Cartridges'' - up to 80% off Unfortunately, I'm not familiar enough with SA to suggest which alternative might be the best.
not sure if it'll be necessary -- bayes will catch it nicely I should think. our tokenizer can cope with that just fine.
It seems like it might be useful to have some way to test word sequences. Bayes doesn't do that. Most of our body tests amount to word sequences and all the business about .{0,5} and such is to allow intermediate words or get around quotes and such. transforms are rather fast and having some tests eliminate non-word characters would not be too expensive. My original scheme may be a bit excessive, but I think some solution may be useful.
ok, I see your point. I'd suggest that a new test type would be better; e.g. phrase USE_IDENTITY using your identity and skipping words would be phrase USE_IDENTITY using 1WORD identity or similar. this would be handy for future use of a lexer for speed, and is a bit more readable... ...thinks... hmm, maybe not. this has no way of doing alternatives like /our readers (?:to|with)/i, so we're pretty much stuck with regexps. However I still think making a new "rule type" for this is easiest.
See bug 1207 as a suggestion on how to deal with the unpleasantness described in comment 2. BTW, why isn't comment numbering active in this Bugzilla instance?
See also bug 1002 for a proposed workaround (summary: add a new backslash-entity for "whitespace or other word-sep chars", which we expand before passing to regexp engine.)
*** Bug 3454 has been marked as a duplicate of this bug. ***
It seems this isn't really an issue... Bayes seems to handle most of these situations, and it seems like it would provide little gain.