SA Bugzilla – Bug 4636
Charset normalization plugin support
Last modified: 2007-10-24 09:12:58 UTC
Add support for a charset normalization plugin, per the thread "Preliminary design proposal for charset normalization support in SpamAssassin" on the dev list.
Created attachment 3192 [details] Work in progress Here is the code so far. The problem I'm running into is that the spamassassin script calls $spamtest->parse() which in turn invokes call_plugins to normalize headers. This invocation of call_plugins happens before config has been read.
looks good IMO. I think we can assume that it is safe to read the configuration data at parse()-time. The configuration specification data is set when Mail::SA is created; I don't see why it cannot then be read when $spamtest->parse() is called, instead of waiting for a later method such as ->check(). (btw I think Theo has something against the Message object having a ptr to $main for some reason; or at least, he did in the past.)
I would appreciate anyone objecting to having the Node object keep a pointer to $main say what they suggest as an alternative. Node::header() is a public interface, used by plugins. It decodes MIME encoded-words, removing the charset labels in the process. The possible alternatives are: 1) Call the plugin from within Node::header(), using a pointer to $main kept by the Node object. 2) Call the plugin from within Node::header(), using a pointer to $main passed in by the caller. This would be an incompatible change to the public Node::header() interface. 3) Change the public Node::header() interface to return the raw form of the header, requiring all callers that want the decoded form to call a newly defined header-decoding method off of some object that has a pointer to $main. 4) Require the charset normalization code be hardwired into SpamAssassin, not in a plugin.
Subject: Re: Charset normalization plugin support On Tue, Oct 18, 2005 at 10:11:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I would appreciate anyone objecting to having the Node object keep a pointer to > $main say what they suggest as an alternative. Before we can discuss anything, it needs to be clear what you're trying to do. The appropriate discussion ought to happen in the ticket, and there's no details here so far. > Node::header() is a public interface, used by plugins. It decodes MIME > encoded-words, removing the charset labels in the process. The possible > alternatives are: Ok. What's the issue with that? > 3) Change the public Node::header() interface to return the raw form of the > header, requiring all callers that want the decoded form to call a newly defined > header-decoding method off of some object that has a pointer to $main. FWIW, it's possible to request the raw version already. > 4) Require the charset normalization code be hardwired into SpamAssassin, not in > a plugin. If it's something we want to standardly do that isn't going to be swapped out or disabled (which I don't think it would be), this would be the way to go imo.
Pasting from my original message from the dev list. Since subsequent discussion indicated that rawbody rules should not charset-normalize the text, I've since found the body normalization best plugs into Node::rendered(). I suspect most Western installations would not want to pay the cost of charset normalization, so would want it disabled. The following is a preliminary proposal for how to add support for normalization of charsets into Perl's Unicode support. The primary reason I want to do this work is to improve the ability of SpamAssassin to discriminate between Japanese ham and Japanese spam. SpamAssassin currently ignores charset information, effectively assuming all mail is in iso-8859-1. This works for users whose ham is encoded in iso-8859-1 and mostly works for users whose ham is encoded in other single-byte charsets. For East Asian languages, this is insufficient for doing text analysis. Since a large number of SpamAssassin users are likely to be uninterested in East Asian ham and thus unlikely to want to pay the cost of charset normalization, the normalization support needs to be optional, defaulting to off. Some messages contain unlabeled charsets, others use MIME charset labels. Some MIME charset labels are not useful (e.g. "unknown-8bit"). To handle such nonlabeled data, it is necessary to run a charset detector over the text in order to determine what to convert it from. Encode::Guess effectively requires the caller to specify the language of the text, so I consider it too simplistic. Better would be Mozilla's universal charset detector, which I would have to wrap up as a cpan module. It is common for Korean messages to have an incorrect MIME label of "iso-8859-1", so it may be necessary to run a charset detector even over MIME-labeled charsets. After the charset has been determined, either from the MIME label or the charset detector, the data needs to be converted from that charset to Perl's internal utf8 form. Encode::decode() is the obvious choice for this, though I can see reasons why an installation might want to be able to replace the charset converters with some other implementation. The following functions, immediately after they all Mail::SpamAssassin::Message::Node::decode, need to call a function that does charset normalization. * Mail::SpamAssassin::Message::get_rendered_body_text_array * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array * Mail::SpamAssassin::Message::get_decoded_body_text_array Furthermore: * Mail::SpamAssassin::Message::Node::_decode_header * Mail::SpamAssassin::Message::Node::__decode_header also need to call a function to do charset normalization. _decode_header for unlabeled charset data, __decode_header for for MIME encoded-words. This new charset normalization function will take as arguments the text and any MIME charset label. The function calls the charset detector and converter as necessary and returns the normalized text in Perl's internal form. The returned text will only have the utf8 flag set if the input charset was not us-ascii or iso-8859-1. This new charset normalization function should most likely use a plugin callback to do all the work, though it only makes sense for one loaded plugin to implement the callback. If no plugin implements the callback, then it should simply return the input text, preserving the current behavior.
Subject: Re: Charset normalization plugin support On Tue, Oct 18, 2005 at 10:31:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > indicated that rawbody rules should not charset-normalize the text, I've since > found the body normalization best plugs into Node::rendered(). I suspect most > Western installations would not want to pay the cost of charset normalization, > so would want it disabled. Earlier in the ticket you were talking about header normalization. Body normalization is a different beast (but it's easier to deal with imo). > The following is a preliminary proposal for how to add support for > normalization of charsets into Perl's Unicode support. The primary > reason I want to do this work is to improve the ability of > SpamAssassin to discriminate between Japanese ham and Japanese spam. It's worth noting that this is actually going to be a much larger issue than just having a plugin, btw. The main problem is that SpamAssassin very specifically disables unicode in every module via "use bytes" (according to the svn log it looks like it was added in at r3997 back in December 2002). > Since a large number of SpamAssassin users are likely to be > uninterested in East Asian ham and thus unlikely to want to pay the > cost of charset normalization, the normalization support needs to be > optional, defaulting to off. > > The following functions, immediately after they all > Mail::SpamAssassin::Message::Node::decode, need to call a > function that does charset normalization. > > * Mail::SpamAssassin::Message::get_rendered_body_text_array > * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array > * Mail::SpamAssassin::Message::get_decoded_body_text_array I was thinking that the plugin would be called by check_start, then get an array of parts via find_parts(), then do any manipulation of the data as required per-part (either dealing with the decoded or the rendered portions, or both). Since find_parts() returns references to the actual parts in the tree, you can just modify as necessary without jumping through a lot of hoops. Then later when those other functions get called, everything would already be normalized out. Potentially, there'd be a new function in Message like "clear_rendered_cache" or something which would delete the cached forms of text_rendered, text_visible_rendered, text_invisible_rendered, and (if necessary/different function) text_decoded. That way you would be sure that the normalization data is what's used after the process occurs, even if those other functions were called by something else previously. > * Mail::SpamAssassin::Message::Node::_decode_header > * Mail::SpamAssassin::Message::Node::__decode_header > > also need to call a function to do charset normalization. > _decode_header for unlabeled charset data, __decode_header for for > MIME encoded-words. I can't think of an easy way to do this other than to do the work in Node itself, or to have a plugin do something similar for the headers as suggested above with the body and manipulate the internal data directly. It's not very clean from an OO perspective. Arguably we'd always want to make sure the message is in utf-8 format internally, and so the code could just be in Message::Node.
(In reply to comment #6) > Earlier in the ticket you were talking about header normalization. Body > normalization is a different beast (but it's easier to deal with imo). They are both components of this enhancement. The header argument is easier to make, though Node::rendered() has a similar argument since charset normalization has to happen before it feeds the text into HTML::Parser. > It's worth noting that this is actually going to be a much larger issue > than just having a plugin, btw. The main problem is that SpamAssassin > very specifically disables unicode in every module via "use bytes" > (according to the svn log it looks like it was added in at r3997 back > in December 2002). I got rid of a bunch of those in r315047. I should audit the remaining ones. > I was thinking that the plugin would be called by check_start, then > get an array of parts via find_parts(), then do any manipulation of > the data as required per-part (either dealing with the decoded or the > rendered portions, or both). Something like this would make sense if normalization/rendering were done outside Message::Node. It would be harder to do lazy normalization of headers that way. > Potentially, there'd be a new function in Message like > "clear_rendered_cache" or something which would delete the cached forms > of text_rendered, text_visible_rendered, text_invisible_rendered, and > (if necessary/different function) text_decoded. If the cache got filled before the normalization plugin got invoked, that would indicate a bug in the order of execution. > It's not very clean from an OO perspective. Arguably we'd always want > to make sure the message is in utf-8 format internally, and so the code > could just be in Message::Node. There's no clean OO separation between data and view, but that was preexisting--Message::Node already knows almost everything about SpamAssassin's view of MIME entities. There is the issue whether charset normalization should be: A) a plugin B) hardcoded but enabled/disabled by config C) hardcoded, always on for sufficiently recent versions of Perl Once I get a charset normalizer hooked up I can get some numbers on how much it costs. I was operating under the assumption that it would be too expensive to enable for everybody. I was thinking that having a plugin allowed people more flexibility in tuning the normalization process, but perhaps that's not strictly necessary.
Subject: Re: Charset normalization plugin support On Tue, Oct 18, 2005 at 02:53:29PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > Potentially, there'd be a new function in Message like > > "clear_rendered_cache" or something which would delete the cached forms > > of text_rendered, text_visible_rendered, text_invisible_rendered, and > > (if necessary/different function) text_decoded. > > If the cache got filled before the normalization plugin got invoked, that would > indicate a bug in the order of execution. Not really. What if I have another plugin that gets called at the same time or earlier because it wants to do something, whereby it populates the cache? > There's no clean OO separation between data and view, but that was > preexisting--Message::Node already knows almost everything about SpamAssassin's > view of MIME entities. I'm not sure I follow here. Message and Message::Node are wholly separate from the rest of the code, with the exception of a handful of calls into Util. What does the SA code know about the message that doesn't come from Message or Message::Node, which then those hook into? > I was thinking that having a plugin allowed people more flexibility in tuning > the normalization process, but perhaps that's not strictly necessary. I would imagine that there wouldn't be a lot of overhead if there's not a lot of normalization going on. If incoming messages aren't in any special charset, I would assume a negligable time penalty to processing. As I recall, we added the "use bytes" stuff everywhere because there were large performance issues with RE against Unicode data, but that was almost three years ago so the details escape me.
Subject: Re: Charset normalization plugin support -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >As I recall, we added the "use bytes" stuff everywhere because there were >large performance issues with RE against Unicode data, but that was almost >three years ago so the details escape me. There were performance issues, and in addition we had made the decision to treat the message data as 7-bit ASCII where possible, so it made sense at the time. There was also a good deal of cargo-culting, in that "use bytes" would hopefully just avoid any unicode-related issues entirely. ;) IMO it now makes sense to try to deal with Unicode more sensibly, as John's plan outlines. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDVZLdMJF5cimLx9ARAj/8AJ94YbSw/CbL3iE5xXuKsjlAwZ7zrACglJpu LzpIzu/usr+PdxSlEUkoTw4= =noOo -----END PGP SIGNATURE-----
(In reply to comment #8) > Not really. What if I have another plugin that gets called at the same time > or earlier because it wants to do something, whereby it populates the cache? Since that other plugin should have gotten data normalized by the charset plugin but did not, that would definitely be a bug. > I'm not sure I follow here. Message and Message::Node are wholly separate > from the rest of the code, with the exception of a handful of calls into Util. > What does the SA code know about the message that doesn't come from Message or > Message::Node, which then those hook into? It's the other way around. Message and Message::Node have detailed knowledge of how SpamAssassin wants to view/render messages for matching. To have SpamAssassin's rendering of messages be configurable and/or controllable by plugins, those objects necessarily need a reference to the configuration/plugin state. > I would imagine that there wouldn't be a lot of overhead if there's not a lot > of normalization going on. If incoming messages aren't in any special > charset, I would assume a negligable time penalty to processing. Charset detection is likely to have to be run on everything, regardless of whether or not the messages are in a special charset.
Created attachment 3213 [details] Work in progress Here's a version that normalizes whenever Perl is a sufficiently new version and the Encode::Detect module is installed. An initial run shows a performance cost of 270%. Optimization to follow.
Committed revision 366594. Performance tests on a US feed shows a 10% performance penalty when doing charset normalization.
This commit seems to have broken make test. I have installed Encode::Detect 0.01 from CPAN. t/uri_text..................NOK 8# Failed test 8 in t/uri_text.t at line 47 fail #8 failure: did not find /^http://foo23500.com$/ t/uri_text..................FAILED test 8 Failed 1/104 tests, 99.04% okay The pattern that is now broken is: http://foo23500.com{ESC}(B/ ^http://foo23500.com$ Is this expected?
I just found this patch today and am testing right now. It seems to work well for iso-2022-jp, shift-jis, and utf-8 text body. > http://foo23500.com{ESC}(B/ ^http://foo23500.com$ This test stream is not good because this pattern never happens with iso-2022-jp. In iso-2022-jp, Japanese characters are ALWAYS enclosed by "{ESC}$B" and "{ESC}(B", such like "{ESC}$B$"$${ESC}(B" (where "{ESC}" is 0x1b). The test pattern only containg trailing escape sequence. The _normalize function can not handle this well and it is not responsibility of this function and Encode. In addition, it is not essential to _normalize func though, the header should be changed. Typically, Content-Type for Japanese iso-2022-jp text body is Content-Type: text/plain;charset=iso-2022-jp Actually, some spams ignore MIME header, so we should test against miscellaneous wrong header including no MIME headers.
(In reply to comment #14) > Actually, some spams ignore MIME header, so we should test against miscellaneous > wrong header including no MIME headers. A significant portion of Asian spam has incorrect MIME charset labels--that is why _normalize uses the detected charset over the labeled charset for Asian charsets. That test failure doesn't surprise me--I suspect with Encode::Detect the url is decoding to http://foo23500.com/
there's other test failures, too; recreate.t is failing, without the Encode::Detect module installed. see the buildbot output.
Some test failes because UTF-8 character can contain \xa0 code but it is considered as whitespace at Message.pm, Message/Node.pm et al. For example, "Joho" (in Japanese): \346 \203 \205 \345 \240 \261 "Shirouto": \347 \264 \240 \344 \272 \272 "Shouko": \350 \250 \274 \346 \213 \240 Any idea to solve this problem?
(In reply to comment #16) > there's other test failures, too; recreate.t is failing, without the > Encode::Detect module installed. see the buildbot output. Works for me. I can't see how my patch could possibly prevent sandbox-hstern.pm from loading.
(In reply to comment #13) > t/uri_text..................NOK 8# Failed test 8 in t/uri_text.t at line 47 fail #8 Fixed in revision 367351. (In reply to comment #17) > Some test failes because UTF-8 character can contain \xa0 code but > it is considered as whitespace at Message.pm, Message/Node.pm et al. Could you be more specific as to the test that is failing? I'm not seeing any failures. Perhaps you have a version of HTML::Parser older than 3.46?
Require HTML::Parser 3.46 or higher to enable charset normalization. Committed revision 367361.
I've commented in the past that I'm opposed to the idea of character set normalization and that this functionality would need to be isolated with options and/or a plugin interface. My reasoning is as follows: - there is a performance penalty associated with character set recoding - spam patterns are generally encoded in a limited number of character sets - therefore, catch rates do not increase with recoding (if anything, they are quite likely to decrease due to spam tricks causing us to pick the wrong character set) - however, ham catch rates will INCREASE since the amount of ham matching the pattern is likely to increase (matches being accidental) - so, S/O is likely to go down for multi-character set rules (more often than not) and performance will go down as well For these reasons, I am -1 (that is, vetoing) the current form of this code that has the performance loss and requires recoding. I would also be -1 on requiring any non-utf8 rules to be utf8. Basically, SpamAssassin does need better understanding of character set and ability to support more character sets better, for rules, descriptions, rendering, and tokenization, but I see no benefit to recoding messages, especially since anti-spam patterns are written against a small subset of possible encodings.
Created attachment 3315 [details] jismail.eml This file contains iso-2022-jp encoded E-mail for testing.
Created attachment 3316 [details] utftest-subset.cf This file contains subset of my rules. This file is UTF-8 encoded.
Created attachment 3317 [details] utftest-full.cf This file contains full Japanese spam words currently I have. UTF-8 encoded.
(In reply to comment #19) > Could you be more specific as to the test that is failing? I'm not seeing any > failures. Perhaps you have a version of HTML::Parser older than 3.46? I am sorry that I did not write the method I used. I tested the patch with my own ruleset and sample e-mail file. I attached these files. utftest-full.cf is long. Please use utftest-subset.cf as cf and test with jismail.eml. My HTML::Parser is 3.36, but "\xa0" is a part of UTF-8 encoded data itself, so I do not think the version of HTML::Parser is problem here.
(In reply to comment #21) > For these reasons, I am -1 (that is, vetoing) the current form of this code > that has the performance loss and requires recoding. If the Encode::Detect module is not installed, there is negligible performance loss and no change of behavior from before the patch (other than the bug fix to include text/plain content in the "visible rendered" form). What do you mean by "requires recoding"? I wasn't aware this change required anything. Is your point that there needs to be some way to disable charset normalization in the case where Encode::Detect is installed? > I would also be -1 on requiring any non-utf8 rules to be utf8. I don't understand what you mean by this. > Basically, SpamAssassin does need better understanding of character set and > ability to support more character sets better, for rules, descriptions, > rendering, and tokenization, but I see no benefit to recoding messages, > especially since anti-spam patterns are written against a small subset of > possible encodings. By that argument, there is no need for "body" or "rawbody" rules. There is a small subset of possible transfer encodings and the like. Motoharu-san has given clear, convincing evidence of the value of charset normalization. Are you saying that you have contradictory experience in the Japanese environment?
(In reply to comment #25) > My HTML::Parser is 3.36, but "\xa0" is a part of UTF-8 encoded data itself, > so I do not think the version of HTML::Parser is problem here. HTML::Parser versions previous to 3.46 have a known bug which triggers when they are fed UTF-8 input containing \xa0 bytes. Please try with a recent version of HTML::Parser.
'If the Encode::Detect module is not installed, there is negligible performance loss and no change of behavior from before the patch' FWIW, it's not good to have that level of differing behaviour, based on whether a dependency module is installed or not; I'd prefer to have another way of turning this functionality on or off. (whether that's manually enabled in config, or inferred from the mail traffic or environment is another question.)
(In reply to comment #28) > FWIW, it's not good to have that level of differing behaviour, based on whether > a dependency module is installed or not; I'd prefer to have another way of > turning this functionality on or off. (whether that's manually enabled in > config, or inferred from the mail traffic or environment is another question.) It may sound odd but option is a option. That is, option is to reflect user's preference, choice. For those who live in asian countries, multi-bytes handling is not special. Charset normalization should be transparent to user if requirements meet. I hope not to create option for this operation. I think there should be an option if we support UTF-8 aware regexp because there will be significant performance loss. This will be a user's choice. Charset normalization is not a user's choice for Jepanese and other asian people. if $normalize_supported in Node.pm is evaluated so many times it would be a problem. But if it is eavluated only once or several times, I think it's OK. (In reply to comment #21) > - spam patterns are generally encoded in a limited number of character sets Recently Japanese spam is increasing. Charset frequently used is iso-2022-jp, shift-jis. There seems to be poor mass mailer. Sometimes Content-type/Content-transfer-encoding missing. Sometimes header encoding is wrong; embedded charset name is iso-2022-jp whereas actual charset is shit-jis, for example. > - therefore, catch rates do not increase with recoding (if anything, they are > quite likely to decrease due to spam tricks causing us to pick the wrong > character set) I only have a few days experience, but most Japanese hams now have BAYES_00 to BAYES_50 and most spams now have BAYES_99. Most significant example is that bayse score for ham sample changes BAYES_99 (non-patched) to BAYES_00 (patched)!! In general, bayes score for spam is not changed, but ham's score are decreased. Word match becomes perfect so we will be able to maintain language specific rules easily. In addition, special note in user_prefs.template will not be necessary, although I have not tested yet.
Created attachment 3330 [details] Patch to enable normalization in config (In reply to comment #28) > FWIW, it's not good to have that level of differing behaviour, based on whether > a dependency module is installed or not; I'd prefer to have another way of > turning this functionality on or off. (whether that's manually enabled in > config, or inferred from the mail traffic or environment is another question.) I have no idea how to reasonably infer this from traffic or environment. Having it manually enabled from config is complicated by the fact that config is currently read *after* the Mail::SpamAssassin::Message object is created. This attached patch addresses that by initializing in Mail::SpamAssassin::parse, but since that is called in Mail::SpamAssassin::new in order to get the parser_dns_pms, that probbly causes config files to be read too early.
*** Bug 3244 has been marked as a duplicate of this bug. ***
Committed revision 372567. This is now controlled by the "normalize_charset" config option. I believe this addresses Quinlan's veto.
isn't this implemented? can we close the bug?
Closing.
See also Bug 5691.
(In reply to comment #35) > See also Bug 5691. to summarise that issue -- it appears the use of utf8::downgrade() in this patch is failing. details over there.