4636 – Charset normalization plugin support

Bug 4636 - Charset normalization plugin support

Summary: Charset normalization plugin support

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Plugins (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	3.2.0
Assignee:	John Gardiner Myers

URL:
Whiteboard:
Keywords:

Duplicates (1):	3244 (view as bug list)
Depends on:
Blocks:

Reported:	2005-10-17 16:18 UTC by John Gardiner Myers
Modified:	2007-10-24 09:12 UTC (History)
CC List:	2 users (show)

Attachment	Type	Actions	Submitter/CLA Status
Work in progress	patch	None	John Gardiner Myers
Work in progress	patch	None	John Gardiner Myers
jismail.eml	text/plain	None	Motoharu Kubo
utftest-subset.cf	text/plain	None	Motoharu Kubo
utftest-full.cf	text/plain	None	Motoharu Kubo
Patch to enable normalization in config	patch	None	John Gardiner Myers
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description John Gardiner Myers 2005-10-17 16:18:11 UTC

Add support for a charset normalization plugin, per the thread "Preliminary
design proposal for charset normalization support in SpamAssassin" on the dev list.

Comment 1 John Gardiner Myers 2005-10-17 16:26:21 UTC

Created attachment 3192 [details]
Work in progress

Here is the code so far.  The problem I'm running into is that the spamassassin
script calls $spamtest->parse() which in turn invokes call_plugins to normalize
headers.  This invocation of call_plugins happens before config has been read.

Comment 2 Justin Mason 2005-10-17 17:01:56 UTC

looks good IMO.

I think we can assume that it is safe to read the configuration data at
parse()-time.   The configuration specification data is set when Mail::SA is
created; I don't see why it cannot then be read when $spamtest->parse() is
called, instead of waiting for a later method such as ->check().

(btw I think Theo has something against the Message object having a ptr to $main
for some reason; or at least, he did in the past.)

Comment 3 John Gardiner Myers 2005-10-18 10:11:48 UTC

I would appreciate anyone objecting to having the Node object keep a pointer to
$main say what they suggest as an alternative.

Node::header() is a public interface, used by plugins.  It decodes MIME
encoded-words, removing the charset labels in the process.  The possible
alternatives are:

1) Call the plugin from within Node::header(), using a pointer to $main kept by
the Node object.

2) Call the plugin from within Node::header(), using a pointer to $main passed
in by the caller.  This would be an incompatible change to the public 
Node::header() interface.

3) Change the public Node::header() interface to return the raw form of the
header, requiring all callers that want the decoded form to call a newly defined
header-decoding method off of some object that has a pointer to $main.

4) Require the charset normalization code be hardwired into SpamAssassin, not in
a plugin.

Comment 4 Theo Van Dinter 2005-10-18 10:23:34 UTC

Subject: Re:  Charset normalization plugin support

On Tue, Oct 18, 2005 at 10:11:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I would appreciate anyone objecting to having the Node object keep a pointer to
> $main say what they suggest as an alternative.

Before we can discuss anything, it needs to be clear what you're trying to do.
The appropriate discussion ought to happen in the ticket, and there's no
details here so far.

> Node::header() is a public interface, used by plugins.  It decodes MIME
> encoded-words, removing the charset labels in the process.  The possible
> alternatives are:

Ok.  What's the issue with that?

> 3) Change the public Node::header() interface to return the raw form of the
> header, requiring all callers that want the decoded form to call a newly defined
> header-decoding method off of some object that has a pointer to $main.

FWIW, it's possible to request the raw version already.

> 4) Require the charset normalization code be hardwired into SpamAssassin, not in
> a plugin.

If it's something we want to standardly do that isn't going to be swapped
out or disabled (which I don't think it would be), this would be the
way to go imo.

Comment 5 John Gardiner Myers 2005-10-18 10:31:54 UTC

Pasting from my original message from the dev list.  Since subsequent discussion
indicated that rawbody rules should not charset-normalize the text, I've since
found the body normalization best plugs into Node::rendered(). I suspect most
Western installations would not want to pay the cost of charset normalization,
so would want it disabled.


The following is a preliminary proposal for how to add support for
normalization of charsets into Perl's Unicode support.  The primary
reason I want to do this work is to improve the ability of
SpamAssassin to discriminate between Japanese ham and Japanese spam.

SpamAssassin currently ignores charset information, effectively
assuming all mail is in iso-8859-1.  This works for users whose ham is
encoded in iso-8859-1 and mostly works for users whose ham is encoded
in other single-byte charsets.  For East Asian languages, this is
insufficient for doing text analysis.

Since a large number of SpamAssassin users are likely to be
uninterested in East Asian ham and thus unlikely to want to pay the
cost of charset normalization, the normalization support needs to be
optional, defaulting to off.

Some messages contain unlabeled charsets, others use MIME charset
labels.  Some MIME charset labels are not useful
(e.g. "unknown-8bit").  To handle such nonlabeled data, it is
necessary to run a charset detector over the text in order to
determine what to convert it from.  Encode::Guess effectively requires
the caller to specify the language of the text, so I consider it too
simplistic.  Better would be Mozilla's universal charset detector,
which I would have to wrap up as a cpan module.

It is common for Korean messages to have an incorrect MIME label of
"iso-8859-1", so it may be necessary to run a charset detector even
over MIME-labeled charsets.

After the charset has been determined, either from the MIME label or
the charset detector, the data needs to be converted from that charset
to Perl's internal utf8 form.  Encode::decode() is the obvious choice
for this, though I can see reasons why an installation might want to
be able to replace the charset converters with some other
implementation.

The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.

* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
* Mail::SpamAssassin::Message::get_decoded_body_text_array

Furthermore:

* Mail::SpamAssassin::Message::Node::_decode_header
* Mail::SpamAssassin::Message::Node::__decode_header

also need to call a function to do charset normalization.
_decode_header for unlabeled charset data, __decode_header for for
MIME encoded-words.

This new charset normalization function will take as arguments the
text and any MIME charset label.  The function calls the charset
detector and converter as necessary and returns the normalized text in
Perl's internal form.  The returned text will only have the utf8 flag
set if the input charset was not us-ascii or iso-8859-1.
This new charset normalization function should most likely use a
plugin callback to do all the work, though it only makes sense for one
loaded plugin to implement the callback.  If no plugin implements the
callback, then it should simply return the input text, preserving the
current behavior.

Comment 6 Theo Van Dinter 2005-10-18 13:13:51 UTC

Subject: Re:  Charset normalization plugin support

On Tue, Oct 18, 2005 at 10:31:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> indicated that rawbody rules should not charset-normalize the text, I've since
> found the body normalization best plugs into Node::rendered(). I suspect most
> Western installations would not want to pay the cost of charset normalization,
> so would want it disabled.

Earlier in the ticket you were talking about header normalization.  Body
normalization is a different beast (but it's easier to deal with imo).

> The following is a preliminary proposal for how to add support for
> normalization of charsets into Perl's Unicode support.  The primary
> reason I want to do this work is to improve the ability of
> SpamAssassin to discriminate between Japanese ham and Japanese spam.

It's worth noting that this is actually going to be a much larger issue
than just having a plugin, btw.  The main problem is that SpamAssassin
very specifically disables unicode in every module via "use bytes"
(according to the svn log it looks like it was added in at r3997 back
in December 2002).

> Since a large number of SpamAssassin users are likely to be
> uninterested in East Asian ham and thus unlikely to want to pay the
> cost of charset normalization, the normalization support needs to be
> optional, defaulting to off.
> 
> The following functions, immediately after they all
> Mail::SpamAssassin::Message::Node::decode, need to call a
> function that does charset normalization.
> 
> * Mail::SpamAssassin::Message::get_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_decoded_body_text_array

I was thinking that the plugin would be called by check_start, then
get an array of parts via find_parts(), then do any manipulation of
the data as required per-part (either dealing with the decoded or the
rendered portions, or both).   Since find_parts() returns references to
the actual parts in the tree, you can just modify as necessary without
jumping through a lot of hoops.  Then later when those other functions
get called, everything would already be normalized out.

Potentially, there'd be a new function in Message like
"clear_rendered_cache" or something which would delete the cached forms
of text_rendered, text_visible_rendered, text_invisible_rendered, and
(if necessary/different function) text_decoded.  That way you would be
sure that the normalization data is what's used after the process occurs,
even if those other functions were called by something else previously.

> * Mail::SpamAssassin::Message::Node::_decode_header
> * Mail::SpamAssassin::Message::Node::__decode_header
> 
> also need to call a function to do charset normalization.
> _decode_header for unlabeled charset data, __decode_header for for
> MIME encoded-words.

I can't think of an easy way to do this other than to do the work in Node
itself, or to have a plugin do something similar for the headers as suggested
above with the body and manipulate the internal data directly.

It's not very clean from an OO perspective.  Arguably we'd always want
to make sure the message is in utf-8 format internally, and so the code
could just be in Message::Node.

Comment 7 John Gardiner Myers 2005-10-18 14:53:28 UTC

(In reply to comment #6)
> Earlier in the ticket you were talking about header normalization.  Body
> normalization is a different beast (but it's easier to deal with imo).

They are both components of this enhancement.  The header argument is easier to
make, though Node::rendered() has a similar argument since charset normalization
has to happen before it feeds the text into HTML::Parser.

> It's worth noting that this is actually going to be a much larger issue
> than just having a plugin, btw.  The main problem is that SpamAssassin
> very specifically disables unicode in every module via "use bytes"
> (according to the svn log it looks like it was added in at r3997 back
> in December 2002).

I got rid of a bunch of those in r315047.  I should audit the remaining ones.

> I was thinking that the plugin would be called by check_start, then
> get an array of parts via find_parts(), then do any manipulation of
> the data as required per-part (either dealing with the decoded or the
> rendered portions, or both).

Something like this would make sense if normalization/rendering were done
outside Message::Node.  It would be harder to do lazy normalization of headers
that way.

> Potentially, there'd be a new function in Message like
> "clear_rendered_cache" or something which would delete the cached forms
> of text_rendered, text_visible_rendered, text_invisible_rendered, and
> (if necessary/different function) text_decoded.

If the cache got filled before the normalization plugin got invoked, that would
indicate a bug in the order of execution.

> It's not very clean from an OO perspective.  Arguably we'd always want
> to make sure the message is in utf-8 format internally, and so the code
> could just be in Message::Node.

There's no clean OO separation between data and view, but that was
preexisting--Message::Node already knows almost everything about SpamAssassin's
view of MIME entities.

There is the issue whether charset normalization should be:

A) a plugin

B) hardcoded but enabled/disabled by config

C) hardcoded, always on for sufficiently recent versions of Perl

Once I get a charset normalizer hooked up I can get some numbers on how much it
costs.  I was operating under the assumption that it would be too expensive to
enable for everybody.

I was thinking that having a plugin allowed people more flexibility in tuning
the normalization process, but perhaps that's not strictly necessary.

Comment 8 Theo Van Dinter 2005-10-18 17:23:49 UTC

Subject: Re:  Charset normalization plugin support

On Tue, Oct 18, 2005 at 02:53:29PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> > Potentially, there'd be a new function in Message like
> > "clear_rendered_cache" or something which would delete the cached forms
> > of text_rendered, text_visible_rendered, text_invisible_rendered, and
> > (if necessary/different function) text_decoded.
> 
> If the cache got filled before the normalization plugin got invoked, that would
> indicate a bug in the order of execution.

Not really.  What if I have another plugin that gets called at the same time
or earlier because it wants to do something, whereby it populates the cache?

> There's no clean OO separation between data and view, but that was
> preexisting--Message::Node already knows almost everything about SpamAssassin's
> view of MIME entities.

I'm not sure I follow here.  Message and Message::Node are wholly separate
from the rest of the code, with the exception of a handful of calls into Util.
What does the SA code know about the message that doesn't come from Message or
Message::Node, which then those hook into?

> I was thinking that having a plugin allowed people more flexibility in tuning
> the normalization process, but perhaps that's not strictly necessary.

I would imagine that there wouldn't be a lot of overhead if there's not a lot
of normalization going on.  If incoming messages aren't in any special
charset, I would assume a negligable time penalty to processing.

As I recall, we added the "use bytes" stuff everywhere because there were
large performance issues with RE against Unicode data, but that was almost
three years ago so the details escape me.

Comment 9 Auto-Mass-Checker 2005-10-18 17:31:30 UTC

Subject: Re:  Charset normalization plugin support 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>As I recall, we added the "use bytes" stuff everywhere because there were
>large performance issues with RE against Unicode data, but that was almost
>three years ago so the details escape me.

There were performance issues, and in addition we had made the decision to
treat the message data as 7-bit ASCII where possible, so it made sense at
the time.   There was also a good deal of cargo-culting, in that "use
bytes" would hopefully just avoid any unicode-related issues entirely. ;)

IMO it now makes sense to try to deal with Unicode more sensibly, as
John's plan outlines.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDVZLdMJF5cimLx9ARAj/8AJ94YbSw/CbL3iE5xXuKsjlAwZ7zrACglJpu
LzpIzu/usr+PdxSlEUkoTw4=
=noOo
-----END PGP SIGNATURE-----

Comment 10 John Gardiner Myers 2005-10-19 23:06:53 UTC

(In reply to comment #8)
> Not really.  What if I have another plugin that gets called at the same time
> or earlier because it wants to do something, whereby it populates the cache?

Since that other plugin should have gotten data normalized by the charset plugin
but did not, that would definitely be a bug.

> I'm not sure I follow here.  Message and Message::Node are wholly separate
> from the rest of the code, with the exception of a handful of calls into Util.
> What does the SA code know about the message that doesn't come from Message or
> Message::Node, which then those hook into?

It's the other way around.  Message and Message::Node have detailed knowledge of
how SpamAssassin wants to view/render messages for matching.  To have
SpamAssassin's rendering of messages be configurable and/or controllable by
plugins, those objects necessarily need a reference to the configuration/plugin
state.

> I would imagine that there wouldn't be a lot of overhead if there's not a lot
> of normalization going on.  If incoming messages aren't in any special
> charset, I would assume a negligable time penalty to processing.

Charset detection is likely to have to be run on everything, regardless of
whether or not the messages are in a special charset.

Comment 11 John Gardiner Myers 2005-10-27 23:25:27 UTC

Created attachment 3213 [details]
Work in progress

Here's a version that normalizes whenever Perl is a sufficiently new version
and the Encode::Detect module is installed.  An initial run shows a performance
cost of 270%.  Optimization to follow.

Comment 12 John Gardiner Myers 2006-01-07 00:10:33 UTC

Committed revision 366594.

Performance tests on a US feed shows a 10% performance penalty when doing
charset normalization.

Comment 13 Henry Stern 2006-01-08 13:10:40 UTC

This commit seems to have broken make test.  I have installed Encode::Detect
0.01 from CPAN.

t/uri_text..................NOK 8# Failed test 8 in t/uri_text.t at line 47 fail #8
failure: did not find /^http://foo23500.com$/
t/uri_text..................FAILED test 8
        Failed 1/104 tests, 99.04% okay

The pattern that is now broken is:
http://foo23500.com{ESC}(B/     ^http://foo23500.com$

Is this expected?

Comment 14 Motoharu Kubo 2006-01-08 15:17:00 UTC

I just found this patch today and am testing right now.  It seems to work well
for iso-2022-jp, shift-jis, and utf-8 text body.

> http://foo23500.com{ESC}(B/     ^http://foo23500.com$

This test stream is not good because this pattern never happens with
iso-2022-jp.

In iso-2022-jp, Japanese characters are ALWAYS enclosed by "{ESC}$B" and
"{ESC}(B", such like "{ESC}$B$"$${ESC}(B" (where "{ESC}" is 0x1b).  The test
pattern only containg trailing escape sequence.  The _normalize function can not
handle this well and it is not responsibility of this function and Encode.

In addition, it is not essential to _normalize func though, the header should be
changed.  Typically, Content-Type for Japanese iso-2022-jp text body is

  Content-Type: text/plain;charset=iso-2022-jp

Actually, some spams ignore MIME header, so we should test against miscellaneous
wrong header including no MIME headers.

Comment 15 John Gardiner Myers 2006-01-08 19:17:56 UTC

(In reply to comment #14)
> Actually, some spams ignore MIME header, so we should test against miscellaneous
> wrong header including no MIME headers.

A significant portion of Asian spam has incorrect MIME charset labels--that is
why _normalize uses the detected charset over the labeled charset for Asian
charsets.

That test failure doesn't surprise me--I suspect with Encode::Detect the url is
decoding to http://foo23500.com/

Comment 16 Justin Mason 2006-01-08 21:45:47 UTC

there's other test failures, too; recreate.t is failing, without the
Encode::Detect module installed.  see the buildbot output.

Comment 17 Motoharu Kubo 2006-01-09 09:51:06 UTC

Some test failes because UTF-8 character can contain \xa0 code but
it is considered as whitespace at Message.pm, Message/Node.pm et al.

For example, 
  "Joho" (in Japanese): \346 \203 \205 \345 \240 \261
  "Shirouto":           \347 \264 \240 \344 \272 \272
  "Shouko":             \350 \250 \274 \346 \213 \240

Any idea to solve this problem?

Comment 18 John Gardiner Myers 2006-01-09 18:42:19 UTC

(In reply to comment #16)
> there's other test failures, too; recreate.t is failing, without the
> Encode::Detect module installed.  see the buildbot output.

Works for me.  I can't see how my patch could possibly prevent sandbox-hstern.pm
from loading.

Comment 19 John Gardiner Myers 2006-01-09 19:29:27 UTC

(In reply to comment #13)
> t/uri_text..................NOK 8# Failed test 8 in t/uri_text.t at line 47
fail #8

Fixed in revision 367351.

(In reply to comment #17)
> Some test failes because UTF-8 character can contain \xa0 code but
> it is considered as whitespace at Message.pm, Message/Node.pm et al.

Could you be more specific as to the test that is failing?  I'm not seeing any
failures.  Perhaps you have a version of HTML::Parser older than 3.46?

Comment 20 John Gardiner Myers 2006-01-09 19:50:00 UTC

Require HTML::Parser 3.46 or higher to enable charset normalization.

Committed revision 367361.

Comment 21 Daniel Quinlan 2006-01-10 03:07:44 UTC

I've commented in the past that I'm opposed to the idea of character set
normalization and that this functionality would need to be isolated with
options and/or a plugin interface.  My reasoning is as follows:

- there is a performance penalty associated with character set recoding
- spam patterns are generally encoded in a limited number of character sets
- therefore, catch rates do not increase with recoding (if anything, they are
  quite likely to decrease due to spam tricks causing us to pick the wrong
  character set)
- however, ham catch rates will INCREASE since the amount of ham matching the
  pattern is likely to increase (matches being accidental)
- so, S/O is likely to go down for multi-character set rules (more often than
  not) and performance will go down as well

For these reasons, I am -1 (that is, vetoing) the current form of this code
that has the performance loss and requires recoding.  I would also be -1 on
requiring any non-utf8 rules to be utf8.

Basically, SpamAssassin does need better understanding of character set and
ability to support more character sets better, for rules, descriptions,
rendering, and tokenization, but I see no benefit to recoding messages,
especially since anti-spam patterns are written against a small subset of
possible encodings.

Comment 22 Motoharu Kubo 2006-01-10 04:45:04 UTC

Created attachment 3315 [details]
jismail.eml

This file contains iso-2022-jp encoded E-mail for testing.

Comment 23 Motoharu Kubo 2006-01-10 04:46:25 UTC

Created attachment 3316 [details]
utftest-subset.cf

This file contains subset of my rules.	This file is UTF-8 encoded.

Comment 24 Motoharu Kubo 2006-01-10 04:48:05 UTC

Created attachment 3317 [details]
utftest-full.cf

This file contains full Japanese spam words currently I have.  UTF-8 encoded.

Comment 25 Motoharu Kubo 2006-01-10 04:59:48 UTC

(In reply to comment #19)
> Could you be more specific as to the test that is failing?  I'm not seeing any
> failures.  Perhaps you have a version of HTML::Parser older than 3.46?

I am sorry that I did not write the method I used.  I tested the patch
with my own ruleset and sample e-mail file.  I attached these files.

utftest-full.cf is long.  Please use utftest-subset.cf as cf and test with
jismail.eml.

My HTML::Parser is 3.36, but "\xa0" is a part of UTF-8 encoded data itself,
so I do not think the version of HTML::Parser is problem here.

Comment 26 John Gardiner Myers 2006-01-10 22:23:01 UTC

(In reply to comment #21)

> For these reasons, I am -1 (that is, vetoing) the current form of this code
> that has the performance loss and requires recoding. 

If the Encode::Detect module is not installed, there is negligible performance
loss and no change of behavior from before the patch (other than the bug fix to
include text/plain content in the "visible rendered" form).

What do you mean by "requires recoding"?  I wasn't aware this change required
anything.

Is your point that there needs to be some way to disable charset normalization
in the case where Encode::Detect is installed?

> I would also be -1 on requiring any non-utf8 rules to be utf8.

I don't understand what you mean by this.

> Basically, SpamAssassin does need better understanding of character set and
> ability to support more character sets better, for rules, descriptions,
> rendering, and tokenization, but I see no benefit to recoding messages,
> especially since anti-spam patterns are written against a small subset of
> possible encodings.

By that argument, there is no need for "body" or "rawbody" rules.  There is a
small subset of possible transfer encodings and the like.

Motoharu-san has given clear, convincing evidence of the value of charset
normalization.  Are you saying that you have contradictory experience in the
Japanese environment?

Comment 27 John Gardiner Myers 2006-01-10 23:09:10 UTC

(In reply to comment #25)
> My HTML::Parser is 3.36, but "\xa0" is a part of UTF-8 encoded data itself,
> so I do not think the version of HTML::Parser is problem here.

HTML::Parser versions previous to 3.46 have a known bug which triggers when they
are fed UTF-8 input containing \xa0 bytes.  Please try with a recent version of
HTML::Parser.

Comment 28 Justin Mason 2006-01-11 22:15:28 UTC

'If the Encode::Detect module is not installed, there is negligible performance
loss and no change of behavior from before the patch'

FWIW, it's not good to have that level of differing behaviour, based on whether
a dependency module is installed or not; I'd prefer to have another way of
turning this functionality on or off.  (whether that's manually enabled in
config, or inferred from the mail traffic or environment is another question.)

Comment 29 Motoharu Kubo 2006-01-12 06:08:39 UTC

(In reply to comment #28)
> FWIW, it's not good to have that level of differing behaviour, based on whether
> a dependency module is installed or not; I'd prefer to have another way of
> turning this functionality on or off.  (whether that's manually enabled in
> config, or inferred from the mail traffic or environment is another question.)

It may sound odd but option is a option.  That is, option is to reflect user's
preference, choice.  For those who live in asian countries, multi-bytes handling
is not special.  Charset normalization should be transparent to user if
requirements meet.  I hope not to create option for this operation.

I think there should be an option if we support UTF-8 aware regexp because there
will be significant performance loss.  This will be a user's choice.  Charset
normalization is not a user's choice for Jepanese and other asian people.

if $normalize_supported in Node.pm is evaluated so many times it would be a
problem.  But if it is eavluated only once or several times, I think it's OK.

(In reply to comment #21)
> - spam patterns are generally encoded in a limited number of character sets

Recently Japanese spam is increasing.  Charset frequently used is iso-2022-jp,
shift-jis.  There seems to be poor mass mailer.  Sometimes
Content-type/Content-transfer-encoding missing.  Sometimes header encoding is
wrong; embedded charset name is iso-2022-jp whereas actual charset is shit-jis,
for example.

> - therefore, catch rates do not increase with recoding (if anything, they are
>   quite likely to decrease due to spam tricks causing us to pick the wrong
>   character set)

I only have a few days experience, but most Japanese hams now have BAYES_00 to
BAYES_50 and most spams now have BAYES_99.  Most significant example is that
bayse score for ham sample changes BAYES_99 (non-patched) to BAYES_00 (patched)!! 
In general, bayes score for spam is not changed, but ham's score are decreased.

Word match becomes perfect so we will be able to maintain language specific
rules easily.  In addition, special note in user_prefs.template will not be
necessary, although I have not tested yet.

Comment 30 John Gardiner Myers 2006-01-13 01:09:31 UTC

Created attachment 3330 [details]
Patch to enable normalization in config

(In reply to comment #28)
> FWIW, it's not good to have that level of differing behaviour, based on
whether
> a dependency module is installed or not; I'd prefer to have another way of
> turning this functionality on or off.  (whether that's manually enabled in
> config, or inferred from the mail traffic or environment is another
question.)

I have no idea how to reasonably infer this from traffic or environment.

Having it manually enabled from config is complicated by the fact that config
is currently read *after* the Mail::SpamAssassin::Message object is created. 
This attached patch addresses that by initializing in
Mail::SpamAssassin::parse, but since that is called in Mail::SpamAssassin::new
in order to get the parser_dns_pms, that probbly causes config files to be read
too early.

Comment 31 John Gardiner Myers 2006-01-17 21:59:48 UTC

*** Bug 3244 has been marked as a duplicate of this bug. ***

Comment 32 John Gardiner Myers 2006-01-26 17:52:32 UTC

Committed revision 372567.

This is now controlled by the "normalize_charset" config option.  I believe this
addresses Quinlan's veto.

Comment 33 Justin Mason 2006-09-05 14:49:55 UTC

isn't this implemented?  can we close the bug?

Comment 34 John Gardiner Myers 2006-09-05 15:12:44 UTC

Closing.

Comment 35 Mark Martinec 2007-10-24 07:41:49 UTC

See also Bug 5691.

Comment 36 Justin Mason 2007-10-24 09:12:58 UTC

(In reply to comment #35)
> See also Bug 5691.

to summarise that issue -- it appears the use of utf8::downgrade() in this patch
is failing. details over there.