4078 – ok_locales not working on windows-* charsets

Bug 4078 - ok_locales not working on windows-* charsets

Summary: ok_locales not working on windows-* charsets

Status:	NEW

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (3):	4794 5188 6684 (view as bug list)
Depends on:
Blocks:

Reported:	2005-01-14 21:29 UTC by Matthew Cline
Modified:	2015-04-13 21:30 UTC (History)
CC List:	7 users (show)

Attachment	Type	Actions	Submitter/CLA Status
Hebrew subject that isn't detected	text/plain	None	Matthew Cline
Russian example	text/plain	None	Darxus
Cyrillic message text using UTF-8 encoding	text/plain	None	Martin Gregorie
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthew Cline 2005-01-14 21:29:33 UTC

The following is rendered as Hebrew by Kmail, yet doesn't trigger the
CHARSET_FARAWAY_HEADER rule:

> Subject: =?Windows-1255?B?+OX25CDs5On06Pgg7uTh8unl+iD57Oo/?=

Comment 1 Bob Menschel 2005-04-07 23:15:37 UTC

Matthew, can you attach a full email we can use as a sample? It'll make it
easier for the devs to develop a fix and validate that fix.  Thanks.

Comment 2 Bob Menschel 2005-05-24 22:50:48 UTC

In Locales.pm, all Windows-* character sets are considered "always OK (the net
speaks mostly roman charsets)", even though Windows-1255 is not a roman charset. 

Should exceptions be tested for?

Comment 3 Bob Menschel 2005-07-02 23:28:16 UTC

Triage: No response to my query. Setting 3.2 milestone, since my comment #2
should at least be discussed for that update.

Comment 4 Matthew Cline 2005-07-03 00:09:14 UTC

Created attachment 2976 [details]
Hebrew subject that isn't detected

Comment 5 Matthew Cline 2005-07-03 00:10:13 UTC

Oops, sorry, I missed your first request.  Yes, it looks like it's a
Windows-1255 encoding:

Subject:
=?Windows-1255?B?6Oz05e8g4+ni6ejs6SDg7Ofl6Okg4SAxNzkg+SLnIOvl7Owg7vns5ecgICA=?=

Comment 6 Bob Menschel 2005-07-03 19:37:26 UTC

the change might be a simple as something like 
Index: lib/Mail/SpamAssassin/Locales.pm
===================================================================
--- lib/Mail/SpamAssassin/Locales.pm    (revision 209004)
+++ lib/Mail/SpamAssassin/Locales.pm    (working copy)
@@ -88,7 +88,7 @@
   return 1 if ($cs =~ /^UTF/);
   return 1 if ($cs =~ /^UCS/);
   return 1 if ($cs =~ /^CP125/);
-  return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
+  return 1 if ($cs =~ /^WINDOWS/) and ($cs !~ /^WINDOWS-1255/);      # argh,
Windows
   return 1 if ($cs eq 'IBM852');
   return 1 if ($cs =~ /^UNICODE11UTF[78]/);    # wtf? never heard of it
   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting to 8bit
though this didn't work for me on the sample email submitted. There are probably
other non-laten WINDOWS-* encodings that should be tested for...

Comment 7 Sidney Markowitz 2006-12-10 23:10:41 UTC

*** Bug 5188 has been marked as a duplicate of this bug. ***

Comment 8 Sidney Markowitz 2006-12-10 23:23:41 UTC

See the comments in bug 5188 which has been closed as a duplicate of this one.

The problem is that there are a number of character sets that have the Roman
alphabet as the 0x20 to 0x7e ASCII characters and some other language in the
high-bit characters. Anyone with a Hebrew Windows machine, for example, is
likely to send all mail, including English, in the Windows-1255 charset. That's
why every charset that begins with "WINDOWS" is whitelisted. As I said in bug
5188 comment 6:

"I'm leaning towards having the charset-faraway test for bodies not give a free
pass to the non-Latin Windows, ISO, and CP125 charsets, since there is already a
test for the majority of characters in the body being high-bit which will allow
through Roman alphabet emails in those charsets. Doing that would require a
change to keep the free pass for the charset-faraway-header test."

In other words, keep the current test to be used for charset-farawy-header, and
use a different sub for the test used for charset-faraway for the body that does
not whitelist non-Latin WINDOWS, ISO, and CP125 charsets. The existing code will
still let the charset be accepted as Roman as long as the body has a majority of
7-bit characters.

Comment 9 Justin Mason 2006-12-11 02:01:28 UTC

Sidney -- sounds good to me.

Comment 10 Justin Mason 2007-01-14 06:29:46 UTC

this is another bug that we either need to resolve before mass-checks, or defer
until 3.3.0, btw.

Comment 11 robert 2007-01-14 06:50:27 UTC

Still don't see the point when I've set

ok_locales en th it

why should I ever be interested in Hebrew?

Secondly near all the Hebrew mail I get like this is image only spam so relying on body rules
to filter it out will not work.

(In reply to comment #10)
> this is another bug that we either need to resolve before mass-checks, or defer
> until 3.3.0, btw.

Comment 12 Sidney Markowitz 2007-01-14 07:54:25 UTC

Robert, consider someone whose native langauge is Hebrew sending you an email in
English from their Windows machine. Their mail client would likely be configured
to use Windows-1255, but the mail itself would be in English, using the standard
ASCII characters. They can do that because Windows-1255, like all Windows-*
character sets, uses the standard ASCII characters in the lower 7 bit range.

The problem is how to distinguish a Hebrew language email from an English email
that happens to use the Windows-1255 charset. We already have a test for more
than half of the characters in the body being non-ASCII, which I think would be
a good differentiator when combined with the test for the charset.

Comment 13 Justin Mason 2007-01-25 10:30:22 UTC

ok, mass-checks have started -- shifting this off to 3.3.0.

Comment 14 Sidney Markowitz 2007-02-17 02:45:17 UTC

*** Bug 4794 has been marked as a duplicate of this bug. ***

Comment 15 Sidney Markowitz 2007-02-17 04:03:23 UTC

I just noticed that my comment 12 doesn't really address comment 11 which really
does need to be answered given that this bug was opened for the example attached
in comment 4 in which only the From and Subject headers contain hibit characters
from the Windows-1255 charset, with the body specifying Windows-1255 but only
containing Roman alphabet characters.

It seems to me that we should not reject mail because the From header is in some
foreign charset. You would get that from someone whose native language is, for
example, Hebrew who is sending email in English.

To catch a Hebrew Subject, we would have to add a test to the check of locale in
HeaderEval.pm for WINDOWS-1255 and a majority of the charactes in the Subject
header being hibit. I think the test for locale in HeaderEval.pm should not test
the From header as it currently does.

There is also a test for locale in HTMLEval.pm. There also we aren't catching
WINDOWS-* charsets such as WINDOWS-1255 used for Hebrew. But to check for hibit
characters we would have to test against the text portion of the HTML.

I think that the changes I suggested in comment 8 are pretty safe, but they only
help to catch non-Roman languages in plain text bodies. The changes for headers
and HTML I think will have to tested against corpora to see how they perform.

Comment 16 Philip Prindeville 2007-02-17 11:19:36 UTC

(In reply to comment #12)
> Robert, consider someone whose native langauge is Hebrew sending you an email in
> English from their Windows machine. Their mail client would likely be configured
> to use Windows-1255, but the mail itself would be in English, using the standard
> ASCII characters. They can do that because Windows-1255, like all Windows-*
> character sets, uses the standard ASCII characters in the lower 7 bit range.

I would reject it anyway.

The rules for the designation of the MIME charset say that the smallest
inclusive charset is what the message should be designated as, regardless of the
default locale.  In this case, if the message was written in Windows-1255 but
used only the ASCII code page, then it should be tagged as USASCII.

If you stuck to a promotion order of USASCII => ISO-8859-1 => UTF8, in fact, you
wouldn't have this issue.

And adding more work-arounds to support useless charsets that add no new value
or functionality, only create more problems, doesn't encourage the vendors to
fix things.

The fault is at the sending side, not the receiver's.

Rejecting a message marked Windows-1255 as the wrong language type when all
you're expecting is English is the correct action.

Comment 17 Sidney Markowitz 2007-02-17 12:01:48 UTC

> The rules for the designation of the MIME charset say

If only it were so easy. SpamAssassin has to separate spam from non-spam, which
may or may not be the same as separating rule-conformant messages from those
that are not conformant to some particular rule.

That's why I said that we are getting into issues that have to be decided by
looking at some corpora. Unfortunately that may not be all that easy. We would
really need some statistics that include mail from people and companies who
natively use charsets such as Hebrew and Cyrillic Windows* or CP125* sending
mail in English to find out what typical mail clients really do with them. For
our purposes rules are less important than common practices.

Comment 18 Justin Mason 2010-01-27 02:21:11 UTC

moving most remaining 3.3.0 bugs to 3.3.1 milestone

Comment 19 Justin Mason 2010-01-27 03:16:46 UTC

reassigning, too

Comment 20 Justin Mason 2010-03-23 16:34:01 UTC

moving all open 3.3.1 bugs to 3.3.2

Comment 21 Karsten Bräckelmann 2010-03-23 17:43:03 UTC

Moving back off of Security, which got changed by accident during the mass Target Milestone move.

Comment 22 Karsten Bräckelmann 2011-10-27 17:23:42 UTC

*** Bug 6684 has been marked as a duplicate of this bug. ***

Comment 23 Darxus 2011-10-27 17:30:21 UTC

(I believe this is the summary Karsten intended to change.  Was previously "Windows Hebrew encoding in subject not detected".)

Comment 24 Darxus 2011-10-27 17:30:56 UTC

Created attachment 4988 [details]
Russian example

Comment 25 Darxus 2011-10-27 18:48:08 UTC

These rules catch the attached examples:

header RUSSIAN_SUBJECT Subject:raw =~ /=\?windows-1251\?/
header HEBREW_SUBJECT Subject:raw =~ /=\?windows-1255\?/

The problem, as explained in comment #8, is that it's possible to send legitimate emails with entirely English subjects using these encodings.  

It would be nice if, in the absence of evidence that mail clients fail to comply with the requirement to use the smallest inclusive character set, we could assume they comply, and assume all emails with these encodings are not English.

Comment 26 Darxus 2011-10-27 18:54:35 UTC

header ARABIC_SUBJECT Subject:raw =~ /=\?windows-1256\?/

Comment 27 Kevin A. McGrail 2011-10-27 18:56:10 UTC

(In reply to comment #25)
> These rules catch the attached examples:
> 
> header RUSSIAN_SUBJECT Subject:raw =~ /=\?windows-1251\?/
> header HEBREW_SUBJECT Subject:raw =~ /=\?windows-1255\?/
> 
> The problem, as explained in comment #8, is that it's possible to send
> legitimate emails with entirely English subjects using these encodings.  
> 
> It would be nice if, in the absence of evidence that mail clients fail to
> comply with the requirement to use the smallest inclusive character set, we
> could assume they comply, and assume all emails with these encodings are not
> English.

I researched this in 2009 and used something similar in milter code:

if ($Subject =~ /=\?(koi8-r|Windows-1251)\?/i) {
  print "DEBUG: Cyrillic Test 1\n";
}

Unfortunately, I found that I had customers who do things like write in Greek occasionally and things like that get slammed. 

Regards,
KAM

Comment 28 Darxus 2011-10-27 19:20:51 UTC

(In reply to comment #27)
> Unfortunately, I found that I had customers who do things like write in Greek
> occasionally and things like that get slammed. 

Damn.  Good info to have though.

How about creating something like rules that detect these character sets after decoding, enabled via ok_locales?

Stuff like:

header RUSSIAN_SUBJECT Subject =~ /(АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ){2}/i

Characters taken from http://en.wikipedia.org/wiki/Russian_alphabet
That Cyrillic "А" actually won't match an English "A".

Comment 29 Martin Gregorie 2011-12-15 19:11:41 UTC

Created attachment 5021 [details]
Cyrillic message text using UTF-8 encoding

Comment 30 Martin Gregorie 2011-12-15 19:13:10 UTC

Any proposed solution must also be capable of determining the language from Unicode codepoints in body text or Subject headers: I've been getting cyrillic text messages using UTF-8 encoding on both body and subject. (example attached).

Comment 31 Kevin A. McGrail 2015-04-13 21:30:36 UTC

short of someone stepping forward with a patch, this is likely to languish unfortunately.  Pushing to future and marking as an enhancement.