1050 – proposal: allow transforms for tests

Bug 1050 - proposal: allow transforms for tests

Summary: proposal: allow transforms for tests

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P3 enhancement
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (2):	1047 3454 (view as bug list)
Depends on:	1002
Blocks:
	Show dependency tree

Reported:	2002-10-02 23:46 UTC by Daniel Quinlan
Modified:	2004-11-30 10:10 UTC (History)
CC List:	2 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Quinlan 2002-10-02 23:46:53 UTC

Proposal to add a new type of test: word sequence.  body tests generally use
whitespace between words as a delimiter.  For a word sequence test, we could
use one or more tr/// operations (different ones for different rules) before
running the test.  Perhaps the tr/// operations could be generalized, but there
may be some issues such as internationalization that would need to be worked out.

The tests would work something like the current spam phrase engine:

  tr/A-Za-z/ /cs;
  tr/A-Z/a-z/;
  for (split) {
    test internals
  }

Except the tr// sequence would be specific to each rule and then the split
could just operate on the second argument of the tr//cs operation.  In this
case, spaces.

Now for rules.  We have a few ways to go here, but I suspect sticking with
perl regular expressions would be the best bet since that's what we're used
to writing.  Then this is really just a transform done before "body" tests.

For example, take a simple word sequence test where the string is long
enough that we don't really care about whitespace or other intervening
characters.

original:

body USE_IDENTITY           /using your identity/i

new:

body USE_IDENTITY           tr/A-Za-z//cd; /usingyouridentity/i

Another, more cautious version would use non-letters as delimiters to arrive
at this:

body USE_IDENTITY           tr/A-Za-z/ /cs; /using your identity/i

If certain transforms become common, we could use aliases, for example:

body USE_IDENTITY           EN_SPACE; /using your identity/i

We could also allow s/// substitutions, but that would have more overhead
and we should avoid it if possible.  tr/// is much faster.  Already with this
proposal, we would have to make a copy of each line being tested to avoid
changing it.

In case it isn't obvious why I am proposing this, see bug 1047.

Comment 1 Daniel Quinlan 2002-10-02 23:49:23 UTC

*** Bug 1047 has been marked as a duplicate of this bug. ***

Comment 2 DJ Atkinson 2002-11-12 13:56:13 UTC

I just want to add my vote of support for coming up with some way around the 
non-whitespace problem.  I'm starting to see a lot more spam with every word 
in the subject in squotes, sometimes even doubled squotes.  Here's an example:

Subject: ''Free'' Shipping 'offer' - Inkjet ''Cartridges'' - up to 80% off

Unfortunately, I'm not familiar enough with SA to suggest which alternative 
might be the best.

Comment 3 Justin Mason 2002-11-12 14:31:46 UTC

not sure if it'll be necessary -- bayes will catch it nicely
I should think.  our tokenizer can cope with that just fine.

Comment 4 Daniel Quinlan 2002-11-12 14:35:06 UTC

It seems like it might be useful to have some way to test word sequences.
Bayes doesn't do that.  Most of our body tests amount to word sequences
and all the business about .{0,5} and such is to allow intermediate words
or get around quotes and such.

transforms are rather fast and having some tests eliminate non-word characters
would not be too expensive.

My original scheme may be a bit excessive, but I think some solution may be
useful.

Comment 5 Justin Mason 2002-11-12 15:40:16 UTC

ok, I see your point.

I'd suggest that a new test type would be better;
e.g.

   phrase USE_IDENTITY    using your identity

and skipping words would be

   phrase USE_IDENTITY    using 1WORD identity

or similar.  this would be handy for future use of a lexer for speed,
and is a bit more readable...

...thinks... hmm, maybe not. this has no way of doing alternatives
like /our readers (?:to|with)/i, so we're pretty much stuck with
regexps.

However I still think making a new "rule type" for this is easiest.

Comment 6 Rob McMillin 2002-11-12 16:04:24 UTC

See bug 1207 as a suggestion on how to deal with the unpleasantness described 
in comment 2.

BTW, why isn't comment numbering active in this Bugzilla instance?

Comment 7 Justin Mason 2002-12-13 10:46:07 UTC

See also bug 1002 for a proposed workaround (summary: add a new
backslash-entity for "whitespace or other word-sep chars",
which we expand before passing to regexp engine.)

Comment 8 Daniel Quinlan 2004-06-01 10:53:22 UTC

*** Bug 3454 has been marked as a duplicate of this bug. ***

Comment 9 Duncan Findlay 2004-11-30 19:10:25 UTC

It seems this isn't really an issue... Bayes seems to handle most of these
situations, and it seems like it would provide little gain.