6674 – new rules for polish users

Bug 6674 - new rules for polish users

Summary: new rules for polish users

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Translations and Languages (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 enhancement
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-10-13 12:56 UTC by Lemat
Modified:	2011-10-27 02:05 UTC (History)
CC List:	4 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lemat 2011-10-13 12:56:50 UTC

In Poland there are spammers who send "an invitation to receive spam". They often refer to our legal system, which disallows UCE but allows UBE (something like CAN-SPAM act). Those spammers think that if they include those references in signature of the spam - the spam will be less spammy. The following rules will catch those references.
Those rules have been working for me for ~3 years with 1 false-positive.
The score of 99 - it works for me.
The ".{1,8}" marks ASCII>128 chars ĄąĆćĘęŁłŃńÓóŚśŻżŹź (AaCcEeLlNnOoSsZzZz with glyph).

body LEMAT_1            /Zgodnie z Ustaw.{1,8}z dnia 26.08.2002 r. \(Dz. U. nr 144, poz.1204\)/i
describe LEMAT_1        Pytanie o zgode na otrzymywanie spamu
score LEMAT_1           99

body LEMAT_2            /wiadomo.{1,8}t.{1,8}wys.{1,8}ali.{1,8}my na og.{1,8}lnodost.{1,8}pny adres e-mailowy/i
describe LEMAT_2        Handlarz adresow email
score LEMAT_2           99

body LEMAT_3            /Je.{1,8}eli nie .{1,8}ycz.{1,8}sobie Pa.{1,8}stwo otrzymywania podobnych informacji/i
describe LEMAT_3        Opt-out spam
score LEMAT_3           99

body LEMAT_5            /zgodnie z Ustaw.{1,8}z dnia 18 lipca 2002 r. o .{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}/i
describe LEMAT_5        Pytanie o zgode na otrzymywanie spamu
score LEMAT_5           99

body LEMAT_6            /Sekcja Informacji Ekonomicznej INSTYTUTU PROMOCJI EKSPORTU I KOOPERACJI bazy danych: FIRMY W POLSCE, FIRMY EUROPY/i
describe LEMAT_6        Spammer IPEIK
score LEMAT_6           99

body LEMAT_7            /\(Dz.\s?U. z 2002r, nr 144 poz. 1204 ustawy z dnia 18 lipca 2002 r.\) oraz dyrektyw.{1,8}UOKiK/i
describe LEMAT_7        Pytanie o zgode na otrzymywanie spamu
score LEMAT_7           99

body LEMAT_8            /Dopuszczalne jest przes.{1,8}anie na adres e-mail pytania.{1,8}czy adresat zgadza si.{1,8}na otrzymywanie drog.{1,8}elektroniczn.{1,8}informacji handlowej/i
describe LEMAT_8        Pytanie o zgode na otrzymywanie spamu
score LEMAT_8           99

body LEMAT_9            /materia.{1,8}y z konferencji nt. Bezpiecze.{1,8}stwa w Sieci Internet, Warszawa 14 marca 2006/i
describe LEMAT_9        Pytanie o zgode na otrzymywanie spamu
score LEMAT_9           99

body LEMAT_10            /Pa.{1,8}stwa adres e-mail pochodzi z og.{1,8}lnie dost.{1,8}pnych .{1,8}r.{1,8}de/i
describe LEMAT_10        Harvester adresow email
score LEMAT_10           99

body LEMAT_26           /Niniejsza wiadomo.{1,8}nie jest informacj.{1,8}handlow.{1,8}, a jedynie zapytaniem o zgod.{1,8}na przesy.{1,8}anie informacji handlowych drog.{1,8}elektroniczn/i
describe LEMAT_26       Pytanie o zgode na otrzymywanie spamu
score LEMAT_26          99

body LEMAT_27           /ustaw.{1,8}z dnia 18 lipca 2002\s*r.{1,8}o.{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.\s?U. z (9 wrze.{1,8}snia )?2002\s*r. Nr 144, poz 1204/i
describe LEMAT_27       Pytanie o zgode na otrzymywanie spamu
score LEMAT_27          99

body LEMAT_29           /Pa.{1,8}stwa dane teleadresowe otrzymali.{1,8}my z bazy HBI Polska sp. z o. o./i
describe LEMAT_29       Paser adresow email
score LEMAT_29          99

body LEMAT_31           /ustaw.{1,8}z dnia 18 lipca 2002\s*r.{1,8}o.{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.\s?U. Nr 144 z 9 wrze.{1,8}snia 2002\s*r., poz 1204/i          
describe LEMAT_31       Pytanie o zgode na otrzymywanie spamu  
score LEMAT_31          99  

body LEMAT_34           /Pa.{1,8}stwa adres mejlowy pobrali.{1,8}my z og.{1,8}lnie dost.{1,8}pnych serwis.{1,8}w internetowych/i
describe LEMAT_34       Harvester adresow email
score LEMAT_34          99

body LEMAT_35           /W zwi.{1,8}zku z art. 10 ustawy z dnia 18 lipca 2002 r. o .{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.U. nr 144, poz. 1204\)/i
describe LEMAT_35       Pytanie o zgode na otrzymywanie spamu
score LEMAT_35          99

body LEMAT_36           /Ustaw.{1,8}z dnia 18.07.2002 r. o.{1,8}wiadczeniu Us.{1,8}ug Drog.{1,8}Elektroniczn.{1,8}\(Dz. U. 2002, nr 144, poz. 1204\)/i                                                                                                          
describe LEMAT_36       Pytanie o zgode na otrzymywanie spamu 
score LEMAT_36          99


Above rules are for Poland only. For other countries will just waste CPU cycles.

Comment 1 Darxus 2011-10-13 16:34:50 UTC

Seems like these should be added to the default rule set?  Lots of language specific rules in there already, mostly English.

Lemma, we could always use more non-English corpora for rule checking and score generation via masscheck:  http://wiki.apache.org/spamassassin/NightlyMassCheck

Comment 2 AXB 2011-10-13 16:51:22 UTC

(In reply to comment #1)
> Seems like these should be added to the default rule set?  Lots of language
> specific rules in there already, mostly English.

Imo, such specific regional rules should be a separately OPTIONAL rule set under

http://wiki.apache.org/spamassassin/CustomRulesets

So anybody who needs them can add them.
In the default ruleset would add little value for most users.

Ideally some volunteer would start language based sa-update channels.. ideally.

Without a masscheck / polish corpus the scores stand no chance anyway so CustomRulesets is the ideal place to go.

Comment 3 Darxus 2011-10-13 16:59:04 UTC

So you actually think it's appropriate for spamassassin to only be useful in English?  That seems pretty wrong to me.

Although it's a good point that without any Polish data coming in through masscheck we couldn't use these in the default set.

Comment 4 AXB 2011-10-13 17:09:41 UTC

(In reply to comment #3)
> So you actually think it's appropriate for spamassassin to only be useful in
> English?  That seems pretty wrong to me.

When 99% of the spam is in English I don't see the problem.
or do we want to impose 100 language rulesets on ppl who don't need them?

> Although it's a good point that without any Polish data coming in through
> masscheck we couldn't use these in the default set.

SA is a framework with a basic set of rules. These work for most ppl for pretty decent deafult spam detection.
Those who need more can find rules via third party sa-update channels or download links in the Wiki.

Same applies to obscure RBLs, etc.

Comment 5 Lemat 2011-10-13 18:28:54 UTC

1) I can start maintaining these rules as CustomRuleset, I see that "Polish Language Ruleset" is empty and Status is "?". I just need to know what shall I do.

2) I'm thinking that maybe SA rules could be packaged with country-specific customrulesets and the postmaster would decide which rulesets are used, something like:

preload_rulesets pl de gr
in local.cf

or maybe SA could detect a language (which is not trivial) and load appropriate customruleset.

I see: Greek, German are active. Romanian is marked as active, but it is empty.

3) Problem with NightlyMassCheck. I do like to setup my mailservers to reject all spam before DATA while in smtp session. Therefore I reject most english spam from zombies leaving some 419ers and lots of polish spam - making those 99% not-exactly-true. I have /var/amavis/quarantine with the remaining pile of spam and there is majority of polish spam. I'm not allowed to peek at other users mailboxes and I do not receive much spam myself. All together makes NightlyMassCheck unusable for me.

Therefore my scores are like GTUBE/EICAR.

Comment 6 D. Stussy 2011-10-13 19:00:21 UTC

I concur:  A custom ruleset with its own update channel is the way to go for all non-English languages (with very limited exceptions).

The most common exception being when English-speaking users are spammed in other languages.  As I'm in the U.S., I've seen spam in Spanish, German, Chinese, and Japanese, but not other languages.  There already exists a Chinese, French, German, Greek, and Japanese SA channels, and a prior Polish ruleset from 2005.

Perhaps the custom ruleset web page should be subdivided into two specific sections:  Languages other than English, and "other" collections.  I also note that we could add a "SA channel" and PGP key fields to the list, where available (or leave that for the "more information" link).

http://wiki.apache.org/spamassassin/ContributingNewRules should also point to the custom ruleset page.  It doesn't.

Comment 7 Darxus 2011-10-13 19:14:46 UTC

(In reply to comment #5)
> 1) I can start maintaining these rules as CustomRuleset, I see that "Polish
> Language Ruleset" is empty and Status is "?". I just need to know what shall I
> do.

That would probably be great.  

At the top of the wiki page, click "Login", then "you can create one now" to create a wiki account, then email your wiki username to dev@spamassassin.apache.org to request write access.  

The Polish language rule set is here:
http://svn.apache.org/repos/asf/spamassassin/branches/3.1/rules/25_body_tests_pl.cf

It looks like the download link was broken by somebody disabling the ability to attach files to the wiki.  It might be best to just copy the entire contents of that 25_body_tests_pl.cf file onto that wiki page and start editing from there.  

Let us know if you need any help.

> 2) I'm thinking that maybe SA rules could be packaged with country-specific
> customrulesets and the postmaster would decide which rulesets are used,
> something like:
> 
> preload_rulesets pl de gr
> in local.cf

Sounds nice to me.

> 3) Problem with NightlyMassCheck. I do like to setup my mailservers to reject
> all spam before DATA while in smtp session. Therefore I reject most english
> spam from zombies leaving some 419ers and lots of polish spam - making those
> 99% not-exactly-true. I have /var/amavis/quarantine with the remaining pile of
> spam and there is majority of polish spam. I'm not allowed to peek at other
> users mailboxes and I do not receive much spam myself. All together makes
> NightlyMassCheck unusable for me.

I believe your situation is not uncommon for people who are contributing via masscheck.  Some people only provide non-spam.  So your data would still be useful to us.

Comment 8 Darxus 2011-10-13 19:18:13 UTC

Just so others don't need to dig up these links:
http://wiki.apache.org/spamassassin/CustomRulesets lists a "Polish Language Ruleset" at http://wiki.apache.org/spamassassin/BodyTestsPl .

(In reply to comment #6)
> Perhaps the custom ruleset web page should be subdivided into two specific
> sections:  Languages other than English, and "other" collections.  I also note

Looks like it.  Go for it.

Comment 9 Adam Katz 2011-10-13 21:15:04 UTC

(In reply to comment #5)
> 1) I can start maintaining these rules as CustomRuleset, I see that "Polish
> Language Ruleset" is empty and Status is "?". I just need to know what shall I
> do.

Segregating rulesets by language is generally a bad idea because it limits visibility (FPs get minimized and ignored) and it becomes impossible to maintain.  There is nothing wrong with this approach if not a part of the main project, like say as an sa-update channel.

> or maybe SA could detect a language (which is not trivial) and load
> appropriate customruleset.
> 
> I see: Greek, German are active. Romanian is marked as active, but it is
> empty.

Language detection with TextCat is awful.  It's better than nothing, but it is frequently wrong.

> 2) I'm thinking that maybe SA rules could be packaged with country-specific
> customrulesets and the postmaster would decide which rulesets are used,
> something like:
> 
> preload_rulesets pl de gr
> in local.cf

I believe language-specific rulesets are already possible in SA via locale support (though note you can currently only have one locale).  Though I've never tried it, you can conceivably write rules like this:

  lang pl body PL_FOO /\btawerna\b/i

PL_FOO would then only be run if the system locale is Polish.  This is currently only used for "describe" lines.

However, I'd rather see this implemented as channels.

If we wanted to get more specific, I'd say the channels should be vetted through mass-check (as my channels are), so that rules good enough to be mainstream can be automatically promoted.  It should be noted that the current ruleqa system with its current corpora is not at all set up to properly evaluate rule efficacy for Polish language mail and would do an awful job.

Comment 10 Darxus 2011-10-26 22:18:02 UTC

Lemat added a "Polish Language Ruleset 2" to http://wiki.apache.org/spamassassin/CustomRulesets
I expect that's as far as this will go.  Closing.

(In reply to comment #4)
> When 99% of the spam is in English I don't see the problem.
> or do we want to impose 100 language rulesets on ppl who don't need them?

I believe the majority of the spam I receive that SA misses is not English.

Comment 11 AXB 2011-10-26 22:28:44 UTC

(In reply to comment #10)
> Lemat added a "Polish Language Ruleset 2" to
> http://wiki.apache.org/spamassassin/CustomRulesets
> I expect that's as far as this will go.  Closing.
> 
> (In reply to comment #4)
> > When 99% of the spam is in English I don't see the problem.
> > or do we want to impose 100 language rulesets on ppl who don't need them?
> 
> I believe the majority of the spam I receive that SA misses is not English.

OUCH! This is NOT "pretty" and should be removed

http://lemat.priv.pl/pliki/sa_body_test_pl.cf

header LEMAT_CHIKOR     eval:check_rbl_txt('chikor.rbl.tld', 'chikor.rbl.tld.')
describe LEMAT_CHIKOR   chikor
score LEMAT_CHIKOR      5

uridnssub       URIBL_TLD2        dynamic.rbl.tld.       A      127.0.0.2
body            URIBL_TLD2        eval:check_uridnsbl('URIBL_TLD2')
describe        URIBL_TLD2        Contains an URL listed in the dynamic.rbl.tld blocklist
tflags          URIBL_TLD2        net
reuse           URIBL_TLD2

uridnssub       URIBL_TLD3        chikor.rbl.tld.       A       127.0.0.3
body            URIBL_TLD3        eval:check_uridnsbl('URIBL_TLD3')
describe        URIBL_TLD3        Contains an URL listed in the chikor.rbl.tld (China) blocklist
tflags          URIBL_TLD3        net
reuse           URIBL_TLD3

uridnssub       URIBL_TLD4        chikor.rbl.tld.       A       127.0.0.4
body            URIBL_TLD4        eval:check_uridnsbl('URIBL_TLD4')
describe        URIBL_TLD4        Contains an URL listed in the chikor.rbl.tld (Korea) blocklist
tflags          URIBL_TLD4        net
reuse           URIBL_TLD4

uridnssub       URIBL_TLD5        chikor.rbl.tld.       A       127.0.0.5
body            URIBL_TLD5        eval:check_uridnsbl('URIBL_TLD5')
describe        URIBL_TLD5        Contains an URL listed in the chikor.rbl.tld (Misc) blocklist
tflags          URIBL_TLD5        net
reuse           URIBL_TLD5

score URIBL_TLD2 2.0
score URIBL_TLD3 4.0
score URIBL_TLD4 4.0
score URIBL_TLD5 10.0

Comment 12 Darxus 2011-10-26 22:38:19 UTC

# Section requires local rbldnsd below with zones from http://lemat.priv.pl/pliki/tld.gz
# Below ¼ odpalonego First section requires the file locally rbldnsd http://lemat.priv.pl/pliki/tld.gz

Huh.

Comment 13 Lemat 2011-10-26 22:41:29 UTC

(In reply to comment #11)

> OUCH! This is NOT "pretty" and should be removed

please elaborate...

Comment 14 Karsten Bräckelmann 2011-10-26 22:54:15 UTC

> > OUCH! This is NOT "pretty" and should be removed
> 
> please elaborate...

(a) It violates SA conventions and best-practices by using ridiculously high scores. In a scoring system like SA, no single rule should score above the default threshold.

(b) The URI DNSBL lookups will fail with this rule-set out of the box, since it requires a local rbldnsd.

I strongly suggest to wrap that part in an "if(0) ... endif" block by default, and have the admin explicitly enable it, IFF the local rbldnsd has been set up. With some additional, verbose explanation.

Comment 15 Lemat 2011-10-27 00:09:18 UTC

ok. Let me explain something using example.

While testing 419 emails SA accumulates score from rules like: LOTTO_AGENT+MONEY_FRAUD_3+ADVANCE_FEE_3_NEW_MONEY+ADVANCE_FEE_4_NEW+... (many more). And the cumulative score is usually above kill level. And this is exactly what I expect from SA - to kill.

For polish spam almost none of the standard SA rules will match. Below is an example from most recent polish spam:

BAYES_99+SPF_PASS+HTML_MESSAGE+MIME_HTML_ONLY+MISSING_MID+FORGED_OUTLOOK_HTML+LEMAT_27

Therefore if I want the spam in polish emails to be killed - I have to set the score like EICAR/GTUBE tests. I have only one bullet (rule) to kill and I want this bullet to kill, not to wound.

If you want me to set SCORE=1 then my rules will be wasting CPU cycles because cumulative score will be much less than $sa_tag2_level_deflt not to mention $sa_kill_level_deflt (amavis variables)

I have just comented out the scores and rbl.tld rules. And I believe I gave enough explanation how the file should be used.

Comment 16 Karsten Bräckelmann 2011-10-27 00:49:55 UTC

(In reply to comment #15)
> ok. Let me explain something using example.
> 
> While testing 419 emails SA accumulates score from rules like:
> LOTTO_AGENT+MONEY_FRAUD_3+ADVANCE_FEE_3_NEW_MONEY+ADVANCE_FEE_4_NEW+... (many
> more). And the cumulative score is usually above kill level. And this is
> exactly what I expect from SA - to kill.

The operative word here is "cumulative". Many rules, not a single one. Precisely what SA and a scoring system in general is about.

> Therefore if I want the spam in polish emails to be killed - I have to set the
> score like EICAR/GTUBE tests. I have only one bullet (rule) to kill and I want
> this bullet to kill, not to wound.

GTUBE has a score of 1000 -- for the one reason to counter *any* other rules. It is a *test-point*, not a rule for production to catch spam. Fortunately, you are wrong and did not set the score like for GTUBE.

> If you want me to set SCORE=1 then my rules will be wasting CPU cycles because
> cumulative score will be much less than $sa_tag2_level_deflt not to mention
> $sa_kill_level_deflt (amavis variables)

I don't think you understand why amavis even has more than one such level...

And no one told you to set the scores to 1. We told you scores of 5 or even 10 definitely are bad. (Unless deliberately set by the admin.)

Moreover, the most important point was the uridnsbl rules, and its requirement for a local rbldnsd. Especially regarding all your rather strict (read safe) body rules as mentioned in your original report comment 0, IMHO it likely is safe to use a score >1. Though 20 is not.


> I have just comented out the scores and rbl.tld rules. And I believe I gave
> enough explanation how the file should be used.

Thanks!

Comment 17 Karsten Bräckelmann 2011-10-27 01:18:46 UTC

Lemat, parts of my previous comment 16 may sound harsher than intended, sorry. That was based on my initial mis-understanding of your last paragraph. I did remove my rant and adjusted the other comments, after I checked your custom rule-set again.

Your contribution, especially the commonly used Polish phrases in spam to make it look legit, is much appreciated. And since the audience of bugzilla (and dev@) is rather limited, you might even want to announce your Polish rule-set to the users@ list, providing a link to the wiki page. That should reach more users and admins interested in Polish specific rules -- and might get you some feedback to refine the rules further.

Comment 18 Darxus 2011-10-27 01:42:00 UTC

I think in the absence of enough other rules to accumulate to push an email over the threshold, it makes plenty of sense to use a blacklist with a single rule that alone is over the threshold.  Not an ideal situation, but if it's your only way to effectively block spam, and you can do it without causing a problematic false positive rate, go for it.


Lemat changing the status of this bug back to "fixed" when posting comment #13 was weird.

Comment 19 Karsten Bräckelmann 2011-10-27 02:05:06 UTC

(In reply to comment #18)
> I think in the absence of enough other rules to accumulate to push an email
> over the threshold, it makes plenty of sense to use a blacklist with a single
> rule that alone is over the threshold.  Not an ideal situation, but if it's
> your only way to effectively block spam, and you can do it without causing a
> problematic false positive rate, go for it.

If $admin does it on his server, and knows the blacklist, sure. If you are publishing rules for others, your responsibilities are much greater.

Also, again, this is about SCORING, thus pushing the score above the SA default threshold of 5. Aiming at 15 or 20 is something different.

classified spam != SMTP reject

> Lemat changing the status of this bug back to "fixed" when posting comment #13
> was weird.

Dunno how that happened, but keeping a bug open in the browser and reloading (without shift) is prone to keep the drop-down boxes' state -- and thus reverting changes with the next comment.