SA Bugzilla – Bug 6674
new rules for polish users
Last modified: 2011-10-27 02:05:06 UTC
In Poland there are spammers who send "an invitation to receive spam". They often refer to our legal system, which disallows UCE but allows UBE (something like CAN-SPAM act). Those spammers think that if they include those references in signature of the spam - the spam will be less spammy. The following rules will catch those references. Those rules have been working for me for ~3 years with 1 false-positive. The score of 99 - it works for me. The ".{1,8}" marks ASCII>128 chars ĄąĆćĘꣳŃńÓ󌜯żŹź (AaCcEeLlNnOoSsZzZz with glyph). body LEMAT_1 /Zgodnie z Ustaw.{1,8}z dnia 26.08.2002 r. \(Dz. U. nr 144, poz.1204\)/i describe LEMAT_1 Pytanie o zgode na otrzymywanie spamu score LEMAT_1 99 body LEMAT_2 /wiadomo.{1,8}t.{1,8}wys.{1,8}ali.{1,8}my na og.{1,8}lnodost.{1,8}pny adres e-mailowy/i describe LEMAT_2 Handlarz adresow email score LEMAT_2 99 body LEMAT_3 /Je.{1,8}eli nie .{1,8}ycz.{1,8}sobie Pa.{1,8}stwo otrzymywania podobnych informacji/i describe LEMAT_3 Opt-out spam score LEMAT_3 99 body LEMAT_5 /zgodnie z Ustaw.{1,8}z dnia 18 lipca 2002 r. o .{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}/i describe LEMAT_5 Pytanie o zgode na otrzymywanie spamu score LEMAT_5 99 body LEMAT_6 /Sekcja Informacji Ekonomicznej INSTYTUTU PROMOCJI EKSPORTU I KOOPERACJI bazy danych: FIRMY W POLSCE, FIRMY EUROPY/i describe LEMAT_6 Spammer IPEIK score LEMAT_6 99 body LEMAT_7 /\(Dz.\s?U. z 2002r, nr 144 poz. 1204 ustawy z dnia 18 lipca 2002 r.\) oraz dyrektyw.{1,8}UOKiK/i describe LEMAT_7 Pytanie o zgode na otrzymywanie spamu score LEMAT_7 99 body LEMAT_8 /Dopuszczalne jest przes.{1,8}anie na adres e-mail pytania.{1,8}czy adresat zgadza si.{1,8}na otrzymywanie drog.{1,8}elektroniczn.{1,8}informacji handlowej/i describe LEMAT_8 Pytanie o zgode na otrzymywanie spamu score LEMAT_8 99 body LEMAT_9 /materia.{1,8}y z konferencji nt. Bezpiecze.{1,8}stwa w Sieci Internet, Warszawa 14 marca 2006/i describe LEMAT_9 Pytanie o zgode na otrzymywanie spamu score LEMAT_9 99 body LEMAT_10 /Pa.{1,8}stwa adres e-mail pochodzi z og.{1,8}lnie dost.{1,8}pnych .{1,8}r.{1,8}de/i describe LEMAT_10 Harvester adresow email score LEMAT_10 99 body LEMAT_26 /Niniejsza wiadomo.{1,8}nie jest informacj.{1,8}handlow.{1,8}, a jedynie zapytaniem o zgod.{1,8}na przesy.{1,8}anie informacji handlowych drog.{1,8}elektroniczn/i describe LEMAT_26 Pytanie o zgode na otrzymywanie spamu score LEMAT_26 99 body LEMAT_27 /ustaw.{1,8}z dnia 18 lipca 2002\s*r.{1,8}o.{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.\s?U. z (9 wrze.{1,8}snia )?2002\s*r. Nr 144, poz 1204/i describe LEMAT_27 Pytanie o zgode na otrzymywanie spamu score LEMAT_27 99 body LEMAT_29 /Pa.{1,8}stwa dane teleadresowe otrzymali.{1,8}my z bazy HBI Polska sp. z o. o./i describe LEMAT_29 Paser adresow email score LEMAT_29 99 body LEMAT_31 /ustaw.{1,8}z dnia 18 lipca 2002\s*r.{1,8}o.{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.\s?U. Nr 144 z 9 wrze.{1,8}snia 2002\s*r., poz 1204/i describe LEMAT_31 Pytanie o zgode na otrzymywanie spamu score LEMAT_31 99 body LEMAT_34 /Pa.{1,8}stwa adres mejlowy pobrali.{1,8}my z og.{1,8}lnie dost.{1,8}pnych serwis.{1,8}w internetowych/i describe LEMAT_34 Harvester adresow email score LEMAT_34 99 body LEMAT_35 /W zwi.{1,8}zku z art. 10 ustawy z dnia 18 lipca 2002 r. o .{1,8}wiadczeniu us.{1,8}ug drog.{1,8}elektroniczn.{1,8}\(Dz.U. nr 144, poz. 1204\)/i describe LEMAT_35 Pytanie o zgode na otrzymywanie spamu score LEMAT_35 99 body LEMAT_36 /Ustaw.{1,8}z dnia 18.07.2002 r. o.{1,8}wiadczeniu Us.{1,8}ug Drog.{1,8}Elektroniczn.{1,8}\(Dz. U. 2002, nr 144, poz. 1204\)/i describe LEMAT_36 Pytanie o zgode na otrzymywanie spamu score LEMAT_36 99 Above rules are for Poland only. For other countries will just waste CPU cycles.
Seems like these should be added to the default rule set? Lots of language specific rules in there already, mostly English. Lemma, we could always use more non-English corpora for rule checking and score generation via masscheck: http://wiki.apache.org/spamassassin/NightlyMassCheck
(In reply to comment #1) > Seems like these should be added to the default rule set? Lots of language > specific rules in there already, mostly English. Imo, such specific regional rules should be a separately OPTIONAL rule set under http://wiki.apache.org/spamassassin/CustomRulesets So anybody who needs them can add them. In the default ruleset would add little value for most users. Ideally some volunteer would start language based sa-update channels.. ideally. Without a masscheck / polish corpus the scores stand no chance anyway so CustomRulesets is the ideal place to go.
So you actually think it's appropriate for spamassassin to only be useful in English? That seems pretty wrong to me. Although it's a good point that without any Polish data coming in through masscheck we couldn't use these in the default set.
(In reply to comment #3) > So you actually think it's appropriate for spamassassin to only be useful in > English? That seems pretty wrong to me. When 99% of the spam is in English I don't see the problem. or do we want to impose 100 language rulesets on ppl who don't need them? > Although it's a good point that without any Polish data coming in through > masscheck we couldn't use these in the default set. SA is a framework with a basic set of rules. These work for most ppl for pretty decent deafult spam detection. Those who need more can find rules via third party sa-update channels or download links in the Wiki. Same applies to obscure RBLs, etc.
1) I can start maintaining these rules as CustomRuleset, I see that "Polish Language Ruleset" is empty and Status is "?". I just need to know what shall I do. 2) I'm thinking that maybe SA rules could be packaged with country-specific customrulesets and the postmaster would decide which rulesets are used, something like: preload_rulesets pl de gr in local.cf or maybe SA could detect a language (which is not trivial) and load appropriate customruleset. I see: Greek, German are active. Romanian is marked as active, but it is empty. 3) Problem with NightlyMassCheck. I do like to setup my mailservers to reject all spam before DATA while in smtp session. Therefore I reject most english spam from zombies leaving some 419ers and lots of polish spam - making those 99% not-exactly-true. I have /var/amavis/quarantine with the remaining pile of spam and there is majority of polish spam. I'm not allowed to peek at other users mailboxes and I do not receive much spam myself. All together makes NightlyMassCheck unusable for me. Therefore my scores are like GTUBE/EICAR.
I concur: A custom ruleset with its own update channel is the way to go for all non-English languages (with very limited exceptions). The most common exception being when English-speaking users are spammed in other languages. As I'm in the U.S., I've seen spam in Spanish, German, Chinese, and Japanese, but not other languages. There already exists a Chinese, French, German, Greek, and Japanese SA channels, and a prior Polish ruleset from 2005. Perhaps the custom ruleset web page should be subdivided into two specific sections: Languages other than English, and "other" collections. I also note that we could add a "SA channel" and PGP key fields to the list, where available (or leave that for the "more information" link). http://wiki.apache.org/spamassassin/ContributingNewRules should also point to the custom ruleset page. It doesn't.
(In reply to comment #5) > 1) I can start maintaining these rules as CustomRuleset, I see that "Polish > Language Ruleset" is empty and Status is "?". I just need to know what shall I > do. That would probably be great. At the top of the wiki page, click "Login", then "you can create one now" to create a wiki account, then email your wiki username to dev@spamassassin.apache.org to request write access. The Polish language rule set is here: http://svn.apache.org/repos/asf/spamassassin/branches/3.1/rules/25_body_tests_pl.cf It looks like the download link was broken by somebody disabling the ability to attach files to the wiki. It might be best to just copy the entire contents of that 25_body_tests_pl.cf file onto that wiki page and start editing from there. Let us know if you need any help. > 2) I'm thinking that maybe SA rules could be packaged with country-specific > customrulesets and the postmaster would decide which rulesets are used, > something like: > > preload_rulesets pl de gr > in local.cf Sounds nice to me. > 3) Problem with NightlyMassCheck. I do like to setup my mailservers to reject > all spam before DATA while in smtp session. Therefore I reject most english > spam from zombies leaving some 419ers and lots of polish spam - making those > 99% not-exactly-true. I have /var/amavis/quarantine with the remaining pile of > spam and there is majority of polish spam. I'm not allowed to peek at other > users mailboxes and I do not receive much spam myself. All together makes > NightlyMassCheck unusable for me. I believe your situation is not uncommon for people who are contributing via masscheck. Some people only provide non-spam. So your data would still be useful to us.
Just so others don't need to dig up these links: http://wiki.apache.org/spamassassin/CustomRulesets lists a "Polish Language Ruleset" at http://wiki.apache.org/spamassassin/BodyTestsPl . (In reply to comment #6) > Perhaps the custom ruleset web page should be subdivided into two specific > sections: Languages other than English, and "other" collections. I also note Looks like it. Go for it.
(In reply to comment #5) > 1) I can start maintaining these rules as CustomRuleset, I see that "Polish > Language Ruleset" is empty and Status is "?". I just need to know what shall I > do. Segregating rulesets by language is generally a bad idea because it limits visibility (FPs get minimized and ignored) and it becomes impossible to maintain. There is nothing wrong with this approach if not a part of the main project, like say as an sa-update channel. > or maybe SA could detect a language (which is not trivial) and load > appropriate customruleset. > > I see: Greek, German are active. Romanian is marked as active, but it is > empty. Language detection with TextCat is awful. It's better than nothing, but it is frequently wrong. > 2) I'm thinking that maybe SA rules could be packaged with country-specific > customrulesets and the postmaster would decide which rulesets are used, > something like: > > preload_rulesets pl de gr > in local.cf I believe language-specific rulesets are already possible in SA via locale support (though note you can currently only have one locale). Though I've never tried it, you can conceivably write rules like this: lang pl body PL_FOO /\btawerna\b/i PL_FOO would then only be run if the system locale is Polish. This is currently only used for "describe" lines. However, I'd rather see this implemented as channels. If we wanted to get more specific, I'd say the channels should be vetted through mass-check (as my channels are), so that rules good enough to be mainstream can be automatically promoted. It should be noted that the current ruleqa system with its current corpora is not at all set up to properly evaluate rule efficacy for Polish language mail and would do an awful job.
Lemat added a "Polish Language Ruleset 2" to http://wiki.apache.org/spamassassin/CustomRulesets I expect that's as far as this will go. Closing. (In reply to comment #4) > When 99% of the spam is in English I don't see the problem. > or do we want to impose 100 language rulesets on ppl who don't need them? I believe the majority of the spam I receive that SA misses is not English.
(In reply to comment #10) > Lemat added a "Polish Language Ruleset 2" to > http://wiki.apache.org/spamassassin/CustomRulesets > I expect that's as far as this will go. Closing. > > (In reply to comment #4) > > When 99% of the spam is in English I don't see the problem. > > or do we want to impose 100 language rulesets on ppl who don't need them? > > I believe the majority of the spam I receive that SA misses is not English. OUCH! This is NOT "pretty" and should be removed http://lemat.priv.pl/pliki/sa_body_test_pl.cf header LEMAT_CHIKOR eval:check_rbl_txt('chikor.rbl.tld', 'chikor.rbl.tld.') describe LEMAT_CHIKOR chikor score LEMAT_CHIKOR 5 uridnssub URIBL_TLD2 dynamic.rbl.tld. A 127.0.0.2 body URIBL_TLD2 eval:check_uridnsbl('URIBL_TLD2') describe URIBL_TLD2 Contains an URL listed in the dynamic.rbl.tld blocklist tflags URIBL_TLD2 net reuse URIBL_TLD2 uridnssub URIBL_TLD3 chikor.rbl.tld. A 127.0.0.3 body URIBL_TLD3 eval:check_uridnsbl('URIBL_TLD3') describe URIBL_TLD3 Contains an URL listed in the chikor.rbl.tld (China) blocklist tflags URIBL_TLD3 net reuse URIBL_TLD3 uridnssub URIBL_TLD4 chikor.rbl.tld. A 127.0.0.4 body URIBL_TLD4 eval:check_uridnsbl('URIBL_TLD4') describe URIBL_TLD4 Contains an URL listed in the chikor.rbl.tld (Korea) blocklist tflags URIBL_TLD4 net reuse URIBL_TLD4 uridnssub URIBL_TLD5 chikor.rbl.tld. A 127.0.0.5 body URIBL_TLD5 eval:check_uridnsbl('URIBL_TLD5') describe URIBL_TLD5 Contains an URL listed in the chikor.rbl.tld (Misc) blocklist tflags URIBL_TLD5 net reuse URIBL_TLD5 score URIBL_TLD2 2.0 score URIBL_TLD3 4.0 score URIBL_TLD4 4.0 score URIBL_TLD5 10.0
# Section requires local rbldnsd below with zones from http://lemat.priv.pl/pliki/tld.gz # Below ¼ odpalonego First section requires the file locally rbldnsd http://lemat.priv.pl/pliki/tld.gz Huh.
(In reply to comment #11) > OUCH! This is NOT "pretty" and should be removed please elaborate...
> > OUCH! This is NOT "pretty" and should be removed > > please elaborate... (a) It violates SA conventions and best-practices by using ridiculously high scores. In a scoring system like SA, no single rule should score above the default threshold. (b) The URI DNSBL lookups will fail with this rule-set out of the box, since it requires a local rbldnsd. I strongly suggest to wrap that part in an "if(0) ... endif" block by default, and have the admin explicitly enable it, IFF the local rbldnsd has been set up. With some additional, verbose explanation.
ok. Let me explain something using example. While testing 419 emails SA accumulates score from rules like: LOTTO_AGENT+MONEY_FRAUD_3+ADVANCE_FEE_3_NEW_MONEY+ADVANCE_FEE_4_NEW+... (many more). And the cumulative score is usually above kill level. And this is exactly what I expect from SA - to kill. For polish spam almost none of the standard SA rules will match. Below is an example from most recent polish spam: BAYES_99+SPF_PASS+HTML_MESSAGE+MIME_HTML_ONLY+MISSING_MID+FORGED_OUTLOOK_HTML+LEMAT_27 Therefore if I want the spam in polish emails to be killed - I have to set the score like EICAR/GTUBE tests. I have only one bullet (rule) to kill and I want this bullet to kill, not to wound. If you want me to set SCORE=1 then my rules will be wasting CPU cycles because cumulative score will be much less than $sa_tag2_level_deflt not to mention $sa_kill_level_deflt (amavis variables) I have just comented out the scores and rbl.tld rules. And I believe I gave enough explanation how the file should be used.
(In reply to comment #15) > ok. Let me explain something using example. > > While testing 419 emails SA accumulates score from rules like: > LOTTO_AGENT+MONEY_FRAUD_3+ADVANCE_FEE_3_NEW_MONEY+ADVANCE_FEE_4_NEW+... (many > more). And the cumulative score is usually above kill level. And this is > exactly what I expect from SA - to kill. The operative word here is "cumulative". Many rules, not a single one. Precisely what SA and a scoring system in general is about. > Therefore if I want the spam in polish emails to be killed - I have to set the > score like EICAR/GTUBE tests. I have only one bullet (rule) to kill and I want > this bullet to kill, not to wound. GTUBE has a score of 1000 -- for the one reason to counter *any* other rules. It is a *test-point*, not a rule for production to catch spam. Fortunately, you are wrong and did not set the score like for GTUBE. > If you want me to set SCORE=1 then my rules will be wasting CPU cycles because > cumulative score will be much less than $sa_tag2_level_deflt not to mention > $sa_kill_level_deflt (amavis variables) I don't think you understand why amavis even has more than one such level... And no one told you to set the scores to 1. We told you scores of 5 or even 10 definitely are bad. (Unless deliberately set by the admin.) Moreover, the most important point was the uridnsbl rules, and its requirement for a local rbldnsd. Especially regarding all your rather strict (read safe) body rules as mentioned in your original report comment 0, IMHO it likely is safe to use a score >1. Though 20 is not. > I have just comented out the scores and rbl.tld rules. And I believe I gave > enough explanation how the file should be used. Thanks!
Lemat, parts of my previous comment 16 may sound harsher than intended, sorry. That was based on my initial mis-understanding of your last paragraph. I did remove my rant and adjusted the other comments, after I checked your custom rule-set again. Your contribution, especially the commonly used Polish phrases in spam to make it look legit, is much appreciated. And since the audience of bugzilla (and dev@) is rather limited, you might even want to announce your Polish rule-set to the users@ list, providing a link to the wiki page. That should reach more users and admins interested in Polish specific rules -- and might get you some feedback to refine the rules further.
I think in the absence of enough other rules to accumulate to push an email over the threshold, it makes plenty of sense to use a blacklist with a single rule that alone is over the threshold. Not an ideal situation, but if it's your only way to effectively block spam, and you can do it without causing a problematic false positive rate, go for it. Lemat changing the status of this bug back to "fixed" when posting comment #13 was weird.
(In reply to comment #18) > I think in the absence of enough other rules to accumulate to push an email > over the threshold, it makes plenty of sense to use a blacklist with a single > rule that alone is over the threshold. Not an ideal situation, but if it's > your only way to effectively block spam, and you can do it without causing a > problematic false positive rate, go for it. If $admin does it on his server, and knows the blacklist, sure. If you are publishing rules for others, your responsibilities are much greater. Also, again, this is about SCORING, thus pushing the score above the SA default threshold of 5. Aiming at 15 or 20 is something different. classified spam != SMTP reject > Lemat changing the status of this bug back to "fixed" when posting comment #13 > was weird. Dunno how that happened, but keeping a bug open in the browser and reloading (without shift) is prone to keep the drop-down boxes' state -- and thus reverting changes with the next comment.