SA Bugzilla – Bug 4078
ok_locales not working on windows-* charsets
Last modified: 2015-04-13 21:30:36 UTC
The following is rendered as Hebrew by Kmail, yet doesn't trigger the CHARSET_FARAWAY_HEADER rule: > Subject: =?Windows-1255?B?+OX25CDs5On06Pgg7uTh8unl+iD57Oo/?=
Matthew, can you attach a full email we can use as a sample? It'll make it easier for the devs to develop a fix and validate that fix. Thanks.
In Locales.pm, all Windows-* character sets are considered "always OK (the net speaks mostly roman charsets)", even though Windows-1255 is not a roman charset. Should exceptions be tested for?
Triage: No response to my query. Setting 3.2 milestone, since my comment #2 should at least be discussed for that update.
Created attachment 2976 [details] Hebrew subject that isn't detected
Oops, sorry, I missed your first request. Yes, it looks like it's a Windows-1255 encoding: Subject: =?Windows-1255?B?6Oz05e8g4+ni6ejs6SDg7Ofl6Okg4SAxNzkg+SLnIOvl7Owg7vns5ecgICA=?=
the change might be a simple as something like Index: lib/Mail/SpamAssassin/Locales.pm =================================================================== --- lib/Mail/SpamAssassin/Locales.pm (revision 209004) +++ lib/Mail/SpamAssassin/Locales.pm (working copy) @@ -88,7 +88,7 @@ return 1 if ($cs =~ /^UTF/); return 1 if ($cs =~ /^UCS/); return 1 if ($cs =~ /^CP125/); - return 1 if ($cs =~ /^WINDOWS/); # argh, Windows + return 1 if ($cs =~ /^WINDOWS/) and ($cs !~ /^WINDOWS-1255/); # argh, Windows return 1 if ($cs eq 'IBM852'); return 1 if ($cs =~ /^UNICODE11UTF[78]/); # wtf? never heard of it return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting to 8bit though this didn't work for me on the sample email submitted. There are probably other non-laten WINDOWS-* encodings that should be tested for...
*** Bug 5188 has been marked as a duplicate of this bug. ***
See the comments in bug 5188 which has been closed as a duplicate of this one. The problem is that there are a number of character sets that have the Roman alphabet as the 0x20 to 0x7e ASCII characters and some other language in the high-bit characters. Anyone with a Hebrew Windows machine, for example, is likely to send all mail, including English, in the Windows-1255 charset. That's why every charset that begins with "WINDOWS" is whitelisted. As I said in bug 5188 comment 6: "I'm leaning towards having the charset-faraway test for bodies not give a free pass to the non-Latin Windows, ISO, and CP125 charsets, since there is already a test for the majority of characters in the body being high-bit which will allow through Roman alphabet emails in those charsets. Doing that would require a change to keep the free pass for the charset-faraway-header test." In other words, keep the current test to be used for charset-farawy-header, and use a different sub for the test used for charset-faraway for the body that does not whitelist non-Latin WINDOWS, ISO, and CP125 charsets. The existing code will still let the charset be accepted as Roman as long as the body has a majority of 7-bit characters.
Sidney -- sounds good to me.
this is another bug that we either need to resolve before mass-checks, or defer until 3.3.0, btw.
Still don't see the point when I've set ok_locales en th it why should I ever be interested in Hebrew? Secondly near all the Hebrew mail I get like this is image only spam so relying on body rules to filter it out will not work. (In reply to comment #10) > this is another bug that we either need to resolve before mass-checks, or defer > until 3.3.0, btw.
Robert, consider someone whose native langauge is Hebrew sending you an email in English from their Windows machine. Their mail client would likely be configured to use Windows-1255, but the mail itself would be in English, using the standard ASCII characters. They can do that because Windows-1255, like all Windows-* character sets, uses the standard ASCII characters in the lower 7 bit range. The problem is how to distinguish a Hebrew language email from an English email that happens to use the Windows-1255 charset. We already have a test for more than half of the characters in the body being non-ASCII, which I think would be a good differentiator when combined with the test for the charset.
ok, mass-checks have started -- shifting this off to 3.3.0.
*** Bug 4794 has been marked as a duplicate of this bug. ***
I just noticed that my comment 12 doesn't really address comment 11 which really does need to be answered given that this bug was opened for the example attached in comment 4 in which only the From and Subject headers contain hibit characters from the Windows-1255 charset, with the body specifying Windows-1255 but only containing Roman alphabet characters. It seems to me that we should not reject mail because the From header is in some foreign charset. You would get that from someone whose native language is, for example, Hebrew who is sending email in English. To catch a Hebrew Subject, we would have to add a test to the check of locale in HeaderEval.pm for WINDOWS-1255 and a majority of the charactes in the Subject header being hibit. I think the test for locale in HeaderEval.pm should not test the From header as it currently does. There is also a test for locale in HTMLEval.pm. There also we aren't catching WINDOWS-* charsets such as WINDOWS-1255 used for Hebrew. But to check for hibit characters we would have to test against the text portion of the HTML. I think that the changes I suggested in comment 8 are pretty safe, but they only help to catch non-Roman languages in plain text bodies. The changes for headers and HTML I think will have to tested against corpora to see how they perform.
(In reply to comment #12) > Robert, consider someone whose native langauge is Hebrew sending you an email in > English from their Windows machine. Their mail client would likely be configured > to use Windows-1255, but the mail itself would be in English, using the standard > ASCII characters. They can do that because Windows-1255, like all Windows-* > character sets, uses the standard ASCII characters in the lower 7 bit range. I would reject it anyway. The rules for the designation of the MIME charset say that the smallest inclusive charset is what the message should be designated as, regardless of the default locale. In this case, if the message was written in Windows-1255 but used only the ASCII code page, then it should be tagged as USASCII. If you stuck to a promotion order of USASCII => ISO-8859-1 => UTF8, in fact, you wouldn't have this issue. And adding more work-arounds to support useless charsets that add no new value or functionality, only create more problems, doesn't encourage the vendors to fix things. The fault is at the sending side, not the receiver's. Rejecting a message marked Windows-1255 as the wrong language type when all you're expecting is English is the correct action.
> The rules for the designation of the MIME charset say If only it were so easy. SpamAssassin has to separate spam from non-spam, which may or may not be the same as separating rule-conformant messages from those that are not conformant to some particular rule. That's why I said that we are getting into issues that have to be decided by looking at some corpora. Unfortunately that may not be all that easy. We would really need some statistics that include mail from people and companies who natively use charsets such as Hebrew and Cyrillic Windows* or CP125* sending mail in English to find out what typical mail clients really do with them. For our purposes rules are less important than common practices.
moving most remaining 3.3.0 bugs to 3.3.1 milestone
reassigning, too
moving all open 3.3.1 bugs to 3.3.2
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
*** Bug 6684 has been marked as a duplicate of this bug. ***
(I believe this is the summary Karsten intended to change. Was previously "Windows Hebrew encoding in subject not detected".)
Created attachment 4988 [details] Russian example
These rules catch the attached examples: header RUSSIAN_SUBJECT Subject:raw =~ /=\?windows-1251\?/ header HEBREW_SUBJECT Subject:raw =~ /=\?windows-1255\?/ The problem, as explained in comment #8, is that it's possible to send legitimate emails with entirely English subjects using these encodings. It would be nice if, in the absence of evidence that mail clients fail to comply with the requirement to use the smallest inclusive character set, we could assume they comply, and assume all emails with these encodings are not English.
header ARABIC_SUBJECT Subject:raw =~ /=\?windows-1256\?/
(In reply to comment #25) > These rules catch the attached examples: > > header RUSSIAN_SUBJECT Subject:raw =~ /=\?windows-1251\?/ > header HEBREW_SUBJECT Subject:raw =~ /=\?windows-1255\?/ > > The problem, as explained in comment #8, is that it's possible to send > legitimate emails with entirely English subjects using these encodings. > > It would be nice if, in the absence of evidence that mail clients fail to > comply with the requirement to use the smallest inclusive character set, we > could assume they comply, and assume all emails with these encodings are not > English. I researched this in 2009 and used something similar in milter code: if ($Subject =~ /=\?(koi8-r|Windows-1251)\?/i) { print "DEBUG: Cyrillic Test 1\n"; } Unfortunately, I found that I had customers who do things like write in Greek occasionally and things like that get slammed. Regards, KAM
(In reply to comment #27) > Unfortunately, I found that I had customers who do things like write in Greek > occasionally and things like that get slammed. Damn. Good info to have though. How about creating something like rules that detect these character sets after decoding, enabled via ok_locales? Stuff like: header RUSSIAN_SUBJECT Subject =~ /(АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ){2}/i Characters taken from http://en.wikipedia.org/wiki/Russian_alphabet That Cyrillic "А" actually won't match an English "A".
Created attachment 5021 [details] Cyrillic message text using UTF-8 encoding
Any proposed solution must also be capable of determining the language from Unicode codepoints in body text or Subject headers: I've been getting cyrillic text messages using UTF-8 encoding on both body and subject. (example attached).
short of someone stepping forward with a patch, this is likely to languish unfortunately. Pushing to future and marking as an enhancement.