SA Bugzilla – Bug 3787
HTML email 'malformed UTF-8 character' warnings from SA
Last modified: 2007-07-10 09:36:37 UTC
First, I apologise for not testing it on 3.0, I don't have that installed to test with. Second, I believe I have the languages set up correctly, so this shouldn't be related to the UTF-8 problem in some Linux versions. The attached spam, when run through 'spamassassin -t', and I presume other times, produces about 40 of the following sort of messages. This might still be a problem in 3.0. Malformed UTF-8 character (unexpected non-continuation byte 0x54, immediately after start byte 0xed) in transliteration (tr///) at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293. Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding start byte) in transliteration (tr///) at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293. Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding start byte) in transliteration (tr///) at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293. Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding start byte) in transliteration (tr///) at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293. The particuular character value it complains about in each line varies.
Created attachment 2352 [details] Spam that causes 'malformed utf-8 character' complaints
Loren, I cannot reproduce this with my svn copy of SA. Can you reproduce on a current version, or should we close this bug entry?
Unable to reproduce under most recent svn. Appears to have been fixed by Bug 4046 *** This bug has been marked as a duplicate of 4046 ***
Not a duplicate.
See attachment 3280 [details] for additional test case.
let me narrow this msg.txt.gz sample down for you.... ---------------------------------------------- Content-Type: text/html; charset=us-ascii <html><body> TUMS® Smoothies™ </body></html> ---------------------------------------------- will produce many of these utf-8 warns. my guess is the HTML::Parser decodes ® amnd ™ and then the match against SARE rules that contain things like [\*ýýýý] and/or [\x96-\x97] causes this?
Do any of the SARE rules use \C ? That would definitely do it. Reducing to a single SARE rule would be helpful.
from the 70_sare_obfu.cf ruleset (because thats the only one i was testing), the following rules result in ~22k warns (~2300 per rule). # grep UTF spamd.debug.txt | cut -d\ -f 22 | sed -e 's/\,//g' | sort | uniq __SARE_OBFU_CIALIS2 SARE_OBFU_GUARANTEE __SARE_OBFU_MEDS2 SARE_OBFU_PRESCRIP SARE_OBFU_PRESCR_SPL1 __SARE_OBFU_PRICE1 __SARE_OBFU_SOFT2 SARE_OBFU_VICODIN __SARE_OBFU_VISIT1 SARE_OBFU_XANAX # grep -c SARE_OBFU_CIALIS2 spamd.debug.txt 2345
FWIW, i dont get any utf-8 warns on Loren's sample....
(i need a quicker box ... and some better skills with grep!) with my full RDJ ruleset, i get: % grep "Malformed UTF-8" spamd.debug.txt | cut -d\ -f 22 | sed -e 's/\,//g' | sort | uniq SARE_OBFUAUCTION SARE_OBFUFCK1 SARE_OBFUGIRLS SARE_OBFUGNGBNG SARE_OBFUHARDCORE SARE_OBFUMONEY1 SARE_OBFUPENIS SARE_OBFUPORNO SARE_OBFUPUSS SARE_OBFUTEENS SARE_OBFUVRGN SARE_OBFU_GUARANTEE SARE_OBFU_PRESCRIP SARE_OBFU_PRESCR_SPL1 SARE_OBFU_VICODIN SARE_OBFU_XANAX SARE_SPEC_REPL_OBFU1 SARE_SPEC_REPL_OBFU2 SARE_SPEC_REPL_OBFU3 SARE_SPEC_REPL_OBFU4 SARE_SPEC_REPL_OBFU5 SARE_SPEC_REPL_OBFU6 __SARE_OBFU_CIALIS2 __SARE_OBFU_MEDS2 __SARE_OBFU_PRICE1 __SARE_OBFU_SOFT2 __SARE_OBFU_VISIT1 apparently, only *OBFU* rules ...
confirm no hits on loren's sample: % grep -c "Malformed UTF-8" loren.log.txt 0
I'm convinced Loren's problem is different than the recent SARE issue. As Loren's issue was WORKSFORME, let's just repurpose this bug for the SARE issue.
also: fwiw, for my sit'n, Hardware: Macintosh OS: Mac OS X (10.3.9 & 10.4.3)
so as I asked in bug 4691, the wrong bug ;), people seeing this issue, please post the following: - the exact warning messages (including logged byte values) - perl version, from perl -V - whether or not the patch from bug 4691 is in use (just to make sure) the perl version data in particular is useful.
bug 4046 had my sample message and debug output that contained the necessary bytes. as far as i can see, its always "non-continuation byte 0x00". attachment 3279 [details] has an abbreviated debug of it. if you want a full debug of attachment 3280 [details] (5.2MB 23k lines), i'd have to send that some other way since bz wont take it. i'm doing this on stock SVN, no patches, Fedora core 3, perl 5.8.5, HTML::Parser 3.46.
> - the exact warning messages (including logged byte values) http://www.mail-archive.com/dev@spamassassin.apache.org/msg11995.html > - perl version, from perl -V as per http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4046, % perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.46 % spamassassin -V SpamAssassin version 3.2.0-r322462 running on Perl version 5.8.6 > the perl version data in particular is useful. more here: http://paste.lisp.org/display/14093 i see the same with 5.8.7, as well > - whether or not the patch from bug 4691 is in use (just to make sure) not here
'if you want a full debug of attachment 3280 [details] [edit] (5.2MB 23k lines), i'd have to send that some other way since bz wont take it. ' thanks Dallas, I don't think so. just a selection of the 'non-continuation byte' warnings, to see if it's always 0x00 or not. (so far it seems not)
Unable to reproduce with current SVN plus 70_sare_obfu.cf. ./spamassassin -V SpamAssassin version 3.2.0-r322462 running on Perl version 5.8.6 Tried with perl 5.8.6, 5.8.7, and 5.8.4 HTML::Parser version 3.45
john, r322462? i can reproduce at will with (at least) r349275 and later ... seems Dallas may be able to as well, but I'm sure he can chime in for himself :-) can you, perhaps, verify your "unable to reproduce" with a more recent version? richard
It's r351501.
agh, my mistake. sorry ... i *still* get occassionally confused by the co'd revision, and that which spamassassin -v reports :-/
tested r351835 on rh73 perl 5.6.1 and dont have any UTF-8 warns there... but didnt expect to since UTF-8 stuff only started showing up in rh8 if i remember correctly.wgwg tested r351835 on fc3 perl 5.8.5-18 (5.8.5-17 had a UTF-8 fix mentioned in the CHANGLOG and I thought that was it but wasnt)... UTF-8 warns all over. just now tested on fc4 perl 5.8.6-4, sa r351835, and i get UTF-8 warns here as well with 70_sare_obfu.cf running. i'm sure anyone running fc4 can reproduce this... as this was a brand new install of fc4 here. 1) checkout svn 2) perl Makefile.PL, make install 3) copy 70_sare_obfu.cf to /etc/mail/spamassassin 4) in one shell, start spamd like so... $ spamd -D -L 5) in another shell, create the offending file as shown above in comment #6 and save it. 6) $ cat file | spamc 7) watch other window for debug of the scan
Dallas, does the problem reproduce using the spamassassin script directly, or is it necessary to use spamd?
on fc3 # cat test | spamassassin -D 2>&1 | grep -c "Malformed UTF" 351 on fc4 # cat test | spamassassin -D 2>&1 | grep -c "Malformed UTF" 858 the file 'test' is identical on both test systems... so go figure :)
dallas: same version of perl for both, or: fc3 perl 5.8.5-18 fc4 perl 5.8.6-4 as above?
fc3 is now perl-5.8.5-20 fc4 is now perl-5.8.6-18 utf warns still present after those updates.
fyi, HTML::Parser has been upgraded ... HTML::Parser 3.48 G/GA/GAAS/HTML-Parser-3.48.tar.gz cref: http://search.cpan.org/src/GAAS/HTML-Parser-3.48/Changes
ya, i tested that friday.. but it was just a change reverted from 3.47 according to the changelog if i remember right.
I've been able to reproduce errors with HTML::Parser 3.45 and earlier. The bug is fixed in HTML::Parser 3.46 https:://rt.cpan.org/Ticket/Display.html?id=15068 I believe we should bump the minimum HTML::Parser version to 3.46
as far as i see, 3.46 doesnt fix this. nor do 3.47 or 3.48. this was just now tested on a FC4 box with perl 5.8.6 # spamd -D -L > spamd.debug 2>&1 # perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.46 # cat /root/test | spamc # grep -c Malformed spamd.debug 858 # spamd -D -L > spamd.debug 2>&1 # perl -e 'use HTML::Parser; print HTML::Parser->VERSION . "\n"'; 3.48 # cat /root/test | spamc # grep -c Malformed spamd.debug 858
i've been searching thru some of the perl lists re: "malformed UTF-8" error. scads of hits, actually ... very few cogent explanations/resolutions that i've found so far, tho :-/ on the exim list, however, i just noted the following from Philip Hazel, (author of Exim & PCRE): http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20051212/msg00097.html which keeps nagging at me in *this* context. just (pseudo)random thought ...
The citation in comment 31 is not useful. I suspect the problem might be specific to Fedora Core, since that appears to be common to all reporters. I suggest trying to reproduce with a stock perl.
correct me if i'm wrong, but i believe richard has the same results on all version of OSX hes tested.
dallas, that's correct. all combos of: OSX 10.3.9/10.4.3 Perl 5.8.5/5.8.6/5.8.7 HTML::Parser 3.45/3.46/3.48 (didn't test 3.47) reproduce the error(s) for me. fwiw, some of the tests were on different boxes, as well ... richard
given that Dallas has noted that the min-version bump checked into trunk didn't fix it, can we revert that change? I'm asking because Ubuntu, at least, is not yet distributing 3.46; I had to hit CPAN for it. in my opinion that implies that it's too "bleeding-edge" a requirement. Also, I note that that H:P bug does not seem to relate at all to the messages posted here as test cases -- I see no "A0" bytes in either. PS: more env details requests. Could everyone post the output of echo $LANG set | grep LC_ I wonder if it's something to do with a UTF-8 locale setting. It'd be good to discount that possibility. Also, Dallas and OpenMacNews -- can you *attach* *corresponding* test messages and some "Malformed UTF-8" lines produced by those messages? Right now the test messages are on various pastebots, at dead URLs, scattered between bugs etc. and it's impossible to reliably map one to the other. (Attaching them to bugs is very important so that the URLs won't "rot" over time.)
The min-version bump got reverted last night since it broke the buildbots. There is definitely a reproducable bug with U+00A0 characters which is fixed by 3.46, though one probably needs to have not-yet-committed charset normalization changes in order to trigger it. By the time SpamAssassin 3.2 is released, HTML::Parser 3.46 or later will have had time to propagate. [jgmyers@pong spamweights]$ echo $LANG en_US [jgmyers@pong spamweights]$ set | grep LC_ [jgmyers@pong spamweights]$
'The min-version bump got reverted last night since it broke the buildbots.' yep, the rest of my email eventually came through and I realised that ;) thanks.
per request & for completeness ... % echo $LANG en_US % set | grep LC_ % % uname -a Darwin devbox 8.3.0 Darwin Kernel Version 8.3.0: Mon Oct 3 20:04:04 PDT 2005; root:xnu-792.6.22.obj~2/RELEASE_PPC Power Macintosh powerpc % perl -V Summary of my perl5 (revision 5 version 8 subversion 7) configuration: Platform: osname=darwin, osvers=8.2.0, archname=darwin-thread-multi-2level uname='darwin devbox 8.2.0 darwin kernel version 8.2.0: fri jun 24 17:46:54 pdt 2005; root:xnu-792.2.4.obj~3release_ppc power macintosh powerpc ' % perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.48 with: /var/MailServer/Conf/SA/Dist populated by the SA distro, and, /var/MailServer/Conf/SA/Local populated by RDJ, with: ================================================================ ${EDITOR} .../rules_du_jour.conf ... TRUSTED_RULESETS="SARE_REDIRECT_POST300 SARE_EVILNUMBERS0 SARE_EVILNUMBERS1 SARE_BAYES_POISON_NXM SARE_HEADER SARE_HEADER_ENG SARE_FRAUD SARE_SPOOF SARE_RANDOM SARE_SPAMCOP_TOP200 SARE_OEM SARE_UNSUB SARE_URI_ENG BOGUSVIRUS SARE_SPECIFIC SARE_OEM SARE_HTML SARE_OBFU SARE_GENLSUBJ SARE_GENLSUBJ_ENG SARE_ADULT SARE_BML TRIPWIRE" SA_DIR="/var/MailServer/Conf/SA/Local" ... ================================================================ with *EITHER* the 'simple' test message as above: XXXXXX == /tmp/test.txt ---------------------------------------------- <html><body> TUMS® Smoothies™ </body></html> ---------------------------------------------- or XXXXXX == /tmp/"Attachment 3280 [details]" (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3280) NOW, % cat /tmp/XXXXXX | \ spamassassin -L \ --lint \ --debug \ --siteconfigpath=/var/MailServer/Conf/SA/Dist \ --configpath=/var/MailServer/Conf/SA/Local \ --prefs-file=local.cf \ --nocreate-prefs >& /tmp/openmacnews.spamd.debug % grep -c "Malformed" /tmp/openmacnews.spamd.debug 0 verified on 3 different boxes, similarly config'd ... !!!??? ok, i'm confused ... have the SARE Rules been updated when noones looking? richard
repeating with: cat /tmp/msg.txt.attachment00 | \ spamassassin -L \ --lint \ --debug \ --siteconfigpath=/var/DarkMatter/MailServer/Conf/SA/Dist \ --configpath=/var/DarkMatter//MailServer/Conf/SA/Local \ --prefs-file=local.cf \ --nocreate-prefs >& /tmp/openmacnews.spamd.debug for cases: % perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.48 % grep -c "Malformed" /tmp/openmacnews.spamd.debug 0 % perl -e 'use HTML::Parser ... 3.47 % grep -c "Malformed ... 0 % perl -e 'use HTML::Parser ... 3.46 % grep -c "Malformed ... 0 % perl -e 'use HTML::Parser ... 3.45 % grep -c "Malformed ... 0
i just looked through the revision logs on svn at SARE and see no recent changes to any obfu tests in any ruleset that appears to fire up these malformed utf-8's. 70_sare_adult.cf, 70_sare_obfu.cf, and 70_sare_specific.cf all have obfu that causes utf-8 warns. once i remove those dozen or so offending rules, i have no issues. are you sure you have one or more of those files in.. /var/DarkMatter//MailServer/Conf/SA/Local does spamd debug show it loading? like... [19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf
I trimmed out the duplicate utf-8 warns in the debug below to leave 1 warn for every unique rule in 70_sare_obfu.cf that triggers the warn. I wasnt running 70_sare_adult.cf or 70_sare_specific during this test, so those obfu rules in there that trigger are not present in this debug. # echo $LANG en_US # set | grep LC_ # # perl -e 'use HTML::Parser; print HTML::Parser->VERSION . "\n"'; 3.46 # svn info /tmp/spamassassin-trunk/ Path: /tmp/spamassassin-trunk URL: http://svn.apache.org/repos/asf/spamassassin/trunk Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 356857 # spamassassin -V SpamAssassin version 3.2.0-r356425 running on Perl version 5.8.6 # ls -la /etc/mail/spamassassin/ -rw-r--r-- 1 root root 158513 Oct 1 15:00 70_sare_obfu.cf -rw-r--r-- 1 root root 890 Sep 15 13:23 init.pre -rw-r--r-- 1 root root 1208 Sep 15 13:23 local.cf -rw-r--r-- 1 root root 2397 Sep 15 13:23 v310.pre # ls -la /usr/share/spamassassin/ -rw-r--r-- 1 root root 5495 Dec 14 14:02 10_default_prefs.cf -rw-r--r-- 1 root root 14312 Dec 14 14:02 20_dnsbl_tests.cf -rw-r--r-- 1 root root 17642 Dec 14 14:02 20_html_tests.cf -rw-r--r-- 1 root root 2164 Dec 14 14:02 20_net_tests.cf -rw-r--r-- 1 root root 2334 Dec 14 14:02 23_bayes.cf -rw-r--r-- 1 root root 420 Dec 14 14:02 25_accessdb.cf -rw-r--r-- 1 root root 1345 Dec 14 14:02 25_antivirus.cf -rw-r--r-- 1 root root 190 Dec 14 14:02 25_dcc.cf -rw-r--r-- 1 root root 1947 Dec 14 14:02 25_domainkeys.cf -rw-r--r-- 1 root root 2738 Dec 14 14:02 25_hashcash.cf -rw-r--r-- 1 root root 189 Dec 14 14:02 25_pyzor.cf -rw-r--r-- 1 root root 2201 Dec 14 14:02 25_razor2.cf -rw-r--r-- 1 root root 2873 Dec 14 14:02 25_spf.cf -rw-r--r-- 1 root root 352 Dec 14 14:02 25_textcat.cf -rw-r--r-- 1 root root 6544 Dec 14 14:02 25_uribl.cf -rw-r--r-- 1 root root 1116 Dec 14 14:02 60_awl.cf -rw-r--r-- 1 root root 4906 Dec 14 14:02 60_whitelist.cf -rw-r--r-- 1 root root 1726 Dec 14 14:02 60_whitelist_subject.cf -rw-r--r-- 1 root root 101479 Dec 14 14:02 languages -rw-r--r-- 1 root root 18944 Dec 14 14:02 triplets.txt -rw-r--r-- 1 root root 1869 Dec 14 14:02 user_prefs.template # cat /root/test | spamc X-Spam-Checker-Version: SpamAssassin 3.2.0-r356425 (2005-12-12) on asset.nmgi.com X-Spam-Level: **** X-Spam-Status: No, score=4.0 required=5.0 tests=HTML_60_70,HTML_MESSAGE, HTML_MISSING_CTYPE,HTML_SHORT_LENGTH autolearn=no version=3.2.0-r356425 Content-Type: text/html; charset=us-ascii <html><body> TUMS® Smoothies™ </body></html> # spamd -D -L > spamd.out 2>&1 ^C # cat spamd.out [19475] dbg: logger: adding facilities: all [19475] dbg: logger: logging level is DBG [19475] dbg: logger: trying to connect to syslog/unix... [19475] dbg: logger: opening syslog with unix socket [19475] dbg: logger: successfully connected to syslog/unix [19475] dbg: logger: successfully added syslog method [19475] dbg: spamd: creating INET socket: [19475] dbg: spamd: Listen: 128 [19475] dbg: spamd: LocalAddr: 127.0.0.1 [19475] dbg: spamd: LocalPort: 783 [19475] dbg: spamd: Proto: 6 [19475] dbg: spamd: ReuseAddr: 1 [19475] dbg: spamd: Type: 1 [19475] dbg: logger: adding facilities: all [19475] dbg: logger: logging level is DBG [19475] dbg: generic: SpamAssassin version 3.2.0-r356425 [19475] dbg: config: score set 0 chosen. [19475] dbg: dns: no ipv6 [19475] dbg: dns: is Net::DNS::Resolver available? yes [19475] dbg: dns: Net::DNS version: 0.49 [19475] dbg: dns: name server: 172.17.1.10, LocalAddr: 0.0.0.0 [19475] dbg: spamd: Preloading modules with HOME=/tmp/spamd-19475-init [19475] dbg: ignore: test message to precompile patterns and load modules [19475] dbg: config: using "/etc/mail/spamassassin" for site rules pre files [19475] dbg: config: read file /etc/mail/spamassassin/init.pre [19475] dbg: config: read file /etc/mail/spamassassin/v310.pre [19475] dbg: config: using "/usr/share/spamassassin" for sys rules pre files [19475] dbg: config: using "/usr/share/spamassassin" for default rules dir [19475] dbg: config: read file /usr/share/spamassassin/10_default_prefs.cf [19475] dbg: config: read file /usr/share/spamassassin/20_dnsbl_tests.cf [19475] dbg: config: read file /usr/share/spamassassin/20_html_tests.cf [19475] dbg: config: read file /usr/share/spamassassin/20_net_tests.cf [19475] dbg: config: read file /usr/share/spamassassin/23_bayes.cf [19475] dbg: config: read file /usr/share/spamassassin/25_accessdb.cf [19475] dbg: config: read file /usr/share/spamassassin/25_antivirus.cf [19475] dbg: config: read file /usr/share/spamassassin/25_dcc.cf [19475] dbg: config: read file /usr/share/spamassassin/25_domainkeys.cf [19475] dbg: config: read file /usr/share/spamassassin/25_hashcash.cf [19475] dbg: config: read file /usr/share/spamassassin/25_pyzor.cf [19475] dbg: config: read file /usr/share/spamassassin/25_razor2.cf [19475] dbg: config: read file /usr/share/spamassassin/25_spf.cf [19475] dbg: config: read file /usr/share/spamassassin/25_textcat.cf [19475] dbg: config: read file /usr/share/spamassassin/25_uribl.cf [19475] dbg: config: read file /usr/share/spamassassin/60_awl.cf [19475] dbg: config: read file /usr/share/spamassassin/60_whitelist.cf [19475] dbg: config: read file /usr/share/spamassassin/60_whitelist_subject.cf [19475] dbg: config: using "/etc/mail/spamassassin" for site rules dir [19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf [19475] dbg: config: read file /etc/mail/spamassassin/local.cf [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::URIDNSBL from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::URIDNSBL=HASH (0x8d3a2c0) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::Hashcash from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::Hashcash=HASH (0x8d28048) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::SPF from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::SPF=HASH(0x8d69c30) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::Pyzor from @INC [19475] dbg: pyzor: local tests only, disabling Pyzor [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::Pyzor=HASH (0x8e5e4f4) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::SpamCop from @INC [19475] dbg: reporter: local tests only, disabling SpamCop [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::SpamCop=HASH (0x8ed6dd0) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::AWL from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::AWL=HASH(0x8efd740) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::AutoLearnThreshold from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH(0x8f08558) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::WhiteListSubject from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::MIMEHeader from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::MIMEHeader=HASH (0x8f1e3fc) [19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::ReplaceTags from @INC [19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::ReplaceTags=HASH (0x8f2e1b0) [19475] dbg: plugin: Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0x8f2e1b0) implements 'finish_parsing_end' [19475] dbg: replacetags: replacing tags [19475] dbg: replacetags: done replacing tags [19475] dbg: bayes: no dbs present, cannot tie DB R/O: /tmp/spamd-19475- init/.spamassassin/bayes_toks [19475] dbg: config: score set 0 chosen. [19475] dbg: message: ---- MIME PARSER START ---- [19475] dbg: message: main message type: text/plain [19475] dbg: message: parsing normal part [19475] dbg: message: added part, type: text/plain [19475] dbg: message: ---- MIME PARSER END ---- [19475] dbg: bayes: no dbs present, cannot tie DB R/O: /tmp/spamd-19475- init/.spamassassin/bayes_toks [19475] dbg: dns: is DNS available? 0 [19475] dbg: metadata: X-Spam-Relays-Trusted: [19475] dbg: metadata: X-Spam-Relays-Untrusted: [19475] dbg: message: no encoding detected [19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) implements 'parsed_metadata' [19475] dbg: rules: local tests only, ignoring RBL eval [19475] dbg: check: running tests for priority: 0 [19475] dbg: rules: running header regexp tests; score so far=0 [19475] dbg: plugin: registering glue method for check_hashcash_value (Mail::SpamAssassin::Plugin::Hashcash=HASH(0x8d28048)) [19475] dbg: plugin: registering glue method for check_hashcash_double_spend (Mail::SpamAssassin::Plugin::Hashcash=HASH(0x8d28048)) [19475] dbg: eval: all '*From' addrs: ignore@compiling.spamassassin.taint.org [19475] dbg: eval: all '*To' addrs: [19475] dbg: plugin: registering glue method for check_subject_in_blacklist (Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80)) [19475] dbg: plugin: registering glue method for check_subject_in_whitelist (Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80)) [19475] dbg: rules: running body-text per-line regexp tests; score so far=0 [19475] dbg: uri: running uri tests; score so far=0 [19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0 [19475] dbg: rules: running full-text regexp tests; score so far=0 [19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) implements 'check_tick' [19475] dbg: check: running tests for priority: 500 [19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) implements 'check_post_dnsbl' [19475] dbg: rules: running meta tests; score so far=0 [19475] dbg: rules: running header regexp tests; score so far=0 [19475] dbg: rules: running body-text per-line regexp tests; score so far=0 [19475] dbg: uri: running uri tests; score so far=0 [19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0 [19475] dbg: rules: running full-text regexp tests; score so far=0 [19475] dbg: check: running tests for priority: 1000 [19475] dbg: rules: running meta tests; score so far=0 [19475] dbg: rules: running header regexp tests; score so far=0 [19475] dbg: plugin: registering glue method for check_from_in_auto_whitelist (Mail::SpamAssassin::Plugin::AWL=HASH(0x8efd740)) [19475] dbg: locker: safe_lock: created /tmp/spamd-19475- init/.spamassassin/auto-whitelist.lock.asset.nmgi.com.19475 [19475] dbg: locker: safe_lock: trying to get lock on /tmp/spamd-19475- init/.spamassassin/auto-whitelist with 0 retries [19475] dbg: locker: safe_lock: link to /tmp/spamd-19475- init/.spamassassin/auto-whitelist.lock: link ok [19475] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in /tmp/spamd-19475-init/.spamassassin/auto-whitelist [19475] dbg: auto-whitelist: db-based ignore@compiling.spamassassin.taint.org|ip=none scores 0/0 [19475] dbg: auto-whitelist: AWL active, pre-score: 0, autolearn score: 0, mean: undef, IP: undef [19475] dbg: auto-whitelist: DB addr list: untie-ing and unlocking [19475] dbg: auto-whitelist: DB addr list: file locked, breaking lock [19475] dbg: locker: safe_unlock: unlink /tmp/spamd-19475- init/.spamassassin/auto-whitelist.lock [19475] dbg: auto-whitelist: post auto-whitelist score: 0 [19475] dbg: rules: running body-text per-line regexp tests; score so far=0 [19475] dbg: uri: running uri tests; score so far=0 [19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0 [19475] dbg: rules: running full-text regexp tests; score so far=0 [19475] dbg: check: is spam? score=0 required=5 [19475] dbg: check: tests= [19475] dbg: check: subtests= [19475] dbg: config: copying current conf to backup [19475] info: spamd: server started on port 783/tcp (running version 3.2.0- r356425) [19475] info: spamd: server pid: 19475 [19475] info: spamd: server successfully spawned child process, pid 19478 [19475] dbg: prefork: child 19478: entering state 0 [19478] dbg: prefork: sysread(8) not ready, wait max 300 secs [19475] dbg: prefork: new lowest idle kid: none [19479] dbg: prefork: sysread(9) not ready, wait max 300 secs [19475] info: spamd: server successfully spawned child process, pid 19479 [19475] dbg: prefork: child 19479: entering state 0 [19475] dbg: prefork: new lowest idle kid: none [19475] dbg: prefork: child 19478: entering state 1 [19475] dbg: prefork: new lowest idle kid: 19478 [19475] dbg: prefork: child reports idle [19475] dbg: prefork: child 19479: entering state 1 [19475] dbg: prefork: new lowest idle kid: 19478 [19475] dbg: prefork: child reports idle [19475] info: prefork: child states: II [19475] dbg: prefork: ordered 19478 to accept [19475] dbg: prefork: sysread(7) not ready, wait max 300 secs [19478] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 34629 [19475] dbg: prefork: child 19478: entering state 2 [19475] dbg: prefork: new lowest idle kid: 19479 [19478] info: spamd: setuid to root succeeded [19478] dbg: info: user has changed [19478] dbg: bayes: no dbs present, cannot tie DB R/O: /root/.spamassassin/bayes_toks [19478] dbg: config: score set 0 chosen. [19478] warn: spamd: still running as root: user not specified with -u, not found, or set to root, falling back to nobody at /usr/bin/spamd line 1152, <GEN5> line 4. [19478] info: spamd: processing message (unknown) for root:99 [19478] dbg: dns: name server: 172.17.1.10, LocalAddr: 0.0.0.0 [19478] dbg: bayes: no dbs present, cannot tie DB R/O: /root/.spamassassin/bayes_toks [19478] dbg: metadata: X-Spam-Relays-Trusted: [19478] dbg: metadata: X-Spam-Relays-Untrusted: [19478] dbg: message: ---- MIME PARSER START ---- [19478] dbg: message: main message type: text/html [19478] dbg: message: parsing normal part [19478] dbg: message: added part, type: text/html [19478] dbg: message: ---- MIME PARSER END ---- [19478] dbg: message: no encoding detected [19478] dbg: rules: local tests only, ignoring RBL eval [19478] dbg: check: running tests for priority: 0 [19478] dbg: rules: running header regexp tests; score so far=0 [19478] dbg: eval: all '*From' addrs: [19478] dbg: eval: all '*To' addrs: [19478] dbg: rules: running body-text per-line regexp tests; score so far=0 [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xcf) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_PRICE1, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd1) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_PRESCR_SPL1, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd5) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_SOFT2, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xce) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_VICODIN, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd0) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_CIALIS2, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd1) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_PRESCRIP, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xce) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_VISIT1, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd2) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_XANAX, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc4) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_GUARANTEE, line 1, <GEN5> line 10. [19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xd0) in pattern match (m//) at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_MEDS2, line 1, <GEN5> line 10. [19478] dbg: uri: running uri tests; score so far=0 [19478] dbg: rules: ran eval rule HTML_SHORT_LENGTH ======> got hit (1) [19478] dbg: rules: ran eval rule __HTML_LENGTH_512 ======> got hit (1) [19478] dbg: rules: ran eval rule HTML_60_70 ======> got hit (1) [19478] dbg: bayes: no dbs present, cannot tie DB R/O: /root/.spamassassin/bayes_toks [19478] dbg: bayes: not scoring message, returning undef [19478] dbg: bayes: opportunistic call attempt failed, DB not readable [19478] dbg: rules: ran eval rule HTML_MESSAGE ======> got hit (1) [19478] dbg: rules: ran eval rule __HTML_LENGTH_384 ======> got hit (1) [19478] dbg: rules: ran eval rule __HTML_LENGTH_0000_1024 ======> got hit (1) [19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=3 [19478] dbg: rules: running full-text regexp tests; score so far=3 [19478] dbg: check: running tests for priority: 500 [19478] dbg: rules: running meta tests; score so far=3 [19478] dbg: rules: running header regexp tests; score so far=4 [19478] dbg: rules: running body-text per-line regexp tests; score so far=4 [19478] dbg: uri: running uri tests; score so far=4 [19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=4 [19478] dbg: rules: running full-text regexp tests; score so far=4 [19478] dbg: check: running tests for priority: 1000 [19478] dbg: rules: running meta tests; score so far=4 [19478] dbg: rules: running header regexp tests; score so far=4 [19478] dbg: rules: running body-text per-line regexp tests; score so far=4 [19478] dbg: uri: running uri tests; score so far=4 [19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=4 [19478] dbg: rules: running full-text regexp tests; score so far=4 [19478] dbg: plugin: Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH (0x8f08558) implements 'autolearn_discriminator' [19478] dbg: learn: auto-learn: currently using scoreset 0 [19478] dbg: learn: auto-learn: message score: 4, computed score for autolearn: 4 [19478] dbg: learn: auto-learn? ham=0.1, spam=12, body-points=3, head-points=0, learned-points=0 [19478] dbg: learn: auto-learn? no: inside auto-learn thresholds, not considered ham or spam [19478] dbg: check: is spam? score=4 required=5 [19478] dbg: check: tests=HTML_60_70,HTML_MESSAGE,HTML_MISSING_CTYPE,HTML_SHORT_LENGTH [19478] dbg: check: subtests=__HTML_LENGTH_0000_1024,__HTML_LENGTH_384,__HTML_LENGTH_512 [19478] info: spamd: clean message (4.0/5.0) for root:99 in 0.0 seconds, 99 bytes. [19478] info: spamd: result: . 4 - HTML_60_70,HTML_MESSAGE,HTML_MISSING_CTYPE,HTML_SHORT_LENGTH scantime=0.0,size=99,user=root,uid=99,required_score=5.0,rhos t=localhost.localdomain,raddr=127.0.0.1,rport=34629,mid=(unknown),autolearn=no [19478] dbg: config: copying current conf from backup [19475] dbg: prefork: child 19478: entering state 1 [19475] dbg: prefork: new lowest idle kid: 19478 [19475] dbg: prefork: child reports idle [19475] info: prefork: child states: II [19478] dbg: prefork: sysread(8) not ready, wait max 300 secs [19475] info: spamd: server killed by SIGINT, shutting down
just reproduced the "0" errors on another box ... <grumble> > are you sure you have one or more of those files in.. /var/MailServer/Conf/SA/Local yup. % pwd /var/MailServer/Conf/SA/Local % ls *.cf 70_sare_adult.cf 70_sare_header.cf 70_sare_specific.cf 72_sare_redirect_post3.0.0.cf 70_sare_bayes_poison_nxm.cf 70_sare_header_eng.cf 70_sare_spoof.cf 99_sare_fraud_post25x.cf 70_sare_evilnum0.cf 70_sare_html.cf 70_sare_unsub.cf bogus-virus-warnings.cf 70_sare_evilnum1.cf 70_sare_obfu.cf 70_sare_uri_eng.cf local.cf 70_sare_genlsubj.cf 70_sare_oem.cf 70_sc_top200.cf tripwire.cf 70_sare_genlsubj_eng.cf 70_sare_random.cf 72_sare_bml_post25x.cf > does spamd debug show it loading? like... > [19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf yup again. % grep _sare_ /tmp/openmacnews.spamd.debug [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_adult.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_bayes_poison_nxm.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_evilnum0.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_evilnum1.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_genlsubj.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_genlsubj_eng.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_header.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_header_eng.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_html.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_obfu.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_oem.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_random.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_specific.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_spoof.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_unsub.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_uri_eng.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/72_sare_bml_post25x.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/72_sare_redirect_post3.0.0.cf [21273] dbg: config: read file /var/MailServer/Conf/SA/Local/99_sare_fraud_post25x.cf
dallas, squinting at differences, i note in your 'cat', [19475] dbg: dns: Net::DNS version: 0.49 try: % perl -e 'use Net::DNS; print Net::DNS->VERSION'; 0.54 ? other than that, not much ...
can you verifiy your spamassassin -V again. i didnt see it in comment 38 or comment 39. # perl -e 'use Net::DNS; print Net::DNS->VERSION . "\n"'; 0.55 # cat /root/test | spamassassin 2>&1 | grep -c Malformed 351
% which spamassassin /usr/local/spamassassin/bin/spamassassin % ls -al /usr/local/spamassassin/bin/spamassassin -r-xr-xr-x 1 root wheel 24979 Dec 8 09:38 /usr/local/spamassassin/bin/spamassassin % spamassassin -V SpamAssassin version 3.2.0-r322462 running on Perl version 5.8.7
could be the HTML::Parser dependencies, esp HTML::Entities. : jm 285...; perl -e 'use HTML::Entities; print $HTML::Entities::VERSION,"\n"' 1.32 : exit=0 Wed Dec 14 14:01:20 PST 2005; cd /home/jm/ftp/spamassassin : jm 286...; perl -e 'use HTML::Parser; print $HTML::Parser::VERSION,"\n"' 3.48 : exit=0 Wed Dec 14 14:01:28 PST 2005; cd /home/jm/ftp/spamassassin : jm 287...; perl -e 'use HTML::Tagset; print $HTML::Tagset::VERSION,"\n"' 3.04
# perl -e 'use HTML::Parser; print $HTML::Parser::VERSION,"\n"' 3.46 # perl -e 'use HTML::Tagset; print $HTML::Tagset::VERSION,"\n"' 3.04 # perl -e 'use HTML::Entities; print $HTML::Entities::VERSION,"\n"' 1.32
ok, got it reproduced, and it's either a perl or HTML::Parser bug. I've narrowed it down to a standalone perl script. More details at: https://rt.cpan.org/Ticket/Display.html?id=16495
wtf -- rt.cpan.org doesn't seem to be recording my comments on that bug! very annoying. here's the demo script. http://taint.org/xfer/2005/demo_utf8_bug.pl
here, HTML::Entities -> 1.32 HTML::Parser -> 3.48 HTML::Tagset -> 3.10
yup. it's back. here's my output from exec'ing http://taint.org/xfer/2005/demo_utf8_bug.pl: http://paste.lisp.org/display/14647
Working on filing a bug against Perl.
Filed http://rt.perl.org/rt3/Ticket/Display.html?id=37950 (May need to enter userid guest password guest)
A workaround is to wrap (?-i: around the non-ASCII characters in the SARE patterns: /(?-i:\xC4)|a/i; instead of: /\xC4|a/i;
yup, lookin good here # perl demo_utf8_bug.pl 2>&1 | grep -c Malformed 64
> A workaround is to wrap (?-i: around the non-ASCII characters in the SARE patterns: is there any reason NOT to? if it works, it's independent of a perl fix ... yes?
(In reply to comment #56) > is there any reason NOT to? Depends on whether or not each rule really wants case independence for those characters. > if it works, it's independent of a perl fix ... yes? Yes, it would be independent of a perl fix. If the rule really didn't want case independence, it could even prevent theoretical FPs.
Here is another workaround that seems to work in the demo script at http://taint.org/xfer/2005/demo_utf8_bug.pl and doesn't require changing all the rule patterns: Insert use encoding 'utf8'; at the beginning of the block that defines the patterns that contain the \xC4 and so on bytes, i.e., sub run_regexp { use encoding 'utf8'; I haven't tried adding that to whatever processes the rule regexps. Can someone who knows where that would go put it in and see if it fixes the problem? Alternatively, utf8::encode($text); on the output from HTML::Parser solves the problem from the other direction. That is, the first changes the pattern , the second changes the parsed text string, either one working to get them to match each other. I'm just more wary about changing the parsed text string because that seems to me more likely to have side effects. I'm open to comments from someone who better understands the internals of Perl strings and UTF-8 and other unicode encodings.
the pattern doesn't need to be marked as UTF-8, since it's in itself composed of pure US-ASCII character set; this really is a perl bug.
Created attachment 3304 [details] Here is a short perl script demonstrating the problem without HTML::Parser If this is really a perl bug, then is this script, which reproduces the problem without requiring HTML::Parser, supposed to work? I'm a bit confused about how perl deals with setting the utf-8 flag or not in these strings. If this script does demonstrate the bug, it might be a better example for you to submit than the one that relies on HTML::Parser.
and this script, w/o HTML::Parser, shows the error, here: http://paste.lisp.org/display/14669
It was just pointed out to me that the perl bug report that John submitted, linked to in comment #53, contains a three line demonstration of the perl bug that doesn't involve HTML::Parser, a lot more clear than the script I attached here.
http://rt.perl.org/rt3/Ticket/Display.html?id=37950 From: Gisle Aas <gisle@ActiveState.com> "Already fixed in blead and perl-5.8.8-tobe.http://public.activestate.com/cgi- bin/perlbrowse?patch=25095"
i just rebuilt the srpm for FC4 with http://public.activestate.com/cgi- bin/perlbrowse?patch=25095 # rpm -Uvh perl-5.8.6-22.i386.rpm Preparing... ########################################### [100%] 1:perl ########################################### [100%] [root@asset /]# perl demo_utf8_bug.pl Wide character in print at demo_utf8_bug.pl line 37. [ â ¢] # cat /root/test | spamassassin 2>&1 | grep -c Malformed 0 no more Malformed UTF-8 warns... :)
at least for me, that patch doesn't apply against a fresh CO of perl 'stable' (v587) ... this attach openmacnews.patch.SA_PERL.3787.txt does. now cooking a new perl build to see if it behaves ...
Created attachment 3305 [details] simple tweak of "http://rt.perl.org/rt3/Ticket/Display.html?id=37950" to apply/try to perl v587
and since i was apply against 5.8.5 (fc3) and 5.8.6 (fc4), this is what i used.... if anyone feels like rebuilding their srpms # cat perl-5.8.6-utf8.patch --- utf8.c.orig 2005-12-15 08:06:59.000000000 -0600 +++ utf8.c 2005-12-15 08:06:32.000000000 -0600 @@ -1976,7 +1976,7 @@ if (u1) to_utf8_fold(p1, foldbuf1, &foldlen1); else { - natbuf[0] = *p1; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1))); to_utf8_fold(natbuf, foldbuf1, &foldlen1); } q1 = foldbuf1; @@ -1986,7 +1986,7 @@ if (u2) to_utf8_fold(p2, foldbuf2, &foldlen2); else { - natbuf[0] = *p2; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2))); to_utf8_fold(natbuf, foldbuf2, &foldlen2); } q2 = foldbuf2;
looks good! with a fresh-from-src perl v587, patched with attachment id=3305 from comment #66 above, none of the test cases trigger the "Malformed ..." errors anymore. loading up all previously-suspect SARE rules on my production system, and now to watch for awhile ... richard
The bug does not appear to trigger when the Latin-1 character is surrounded by the [ ] regex operator. It is triggered when the Latin-1 character is followed by the | regex operator and case insensitivity is in effect. This will affect KOREAN_UCE_SUBJECT when charset normalization of headers is implemented.
'The bug does not appear to trigger when the Latin-1 character is surrounded by the [ ] regex operator. It is triggered when the Latin-1 character is followed by the | regex operator and case insensitivity is in effect.' turning off case-i seems to be an easier fix, btw.
The point of my previous comment is to help identify which rules need workarounds.
personally, after the patch to perl, i haven't a trace of this prob anymore. others? this BUG still have legs, or status change in order?
We can hardly require Perl 5.8.8 as a minimum version. I'm not sure what else we can do besides maintain the disable-case-insensitivity workarounds in the rules.
I don't think it warrants making perl 5.8.8 a requirement for SA. if someone runs into this warning, they can: 1. upgrade perl (last resort imo) 2. fix the rules to work around it (more likely) I'm marking this WORKSFORME, since it's not an SA bug per se, and those fixes/workarounds are applicable. if we run into rules that produce this warning in the distributed rulesets, we can fix those as they arise, now that we know how to.
can anyone yet verify that this *has* been resolved in perl-5.8.8-release? per notes below it was "already fixed" in "5.8.8 to be", but i have not, personally, checked as the 587 patch *is* working ...
*** Bug 4665 has been marked as a duplicate of this bug. ***
for those that give a hoot, i'll answer my own question ... at least on OSX 10.4.4, looks like with perl588-release, all's well with SA head (r376926), no patch required ...
*** Bug 5440 has been marked as a duplicate of this bug. ***
*** Bug 5437 has been marked as a duplicate of this bug. ***
ok, it's been a long time since this bug was opened, so I'll give a quick summary for the new arrivals coming from bugs 5440 and 5437. There's a perl bug in dealing with matching ISO-8859-1 patterns against UTF-8 strings: http://rt.perl.org/rt3/Public/Bug/Display.html?id=37950 . Here are the options: - Apparently this bug is fixed in perl 5.8.8, so you could upgrade to that. - Alternatively you could rebuild your current perl from source using the patch here, or at that rt.perl.org bug. - Alternatively, you could fix the rules to avoid the bug: see comments 54, 69 and 70. (The easiest way is to remove the /i at the end of the rule and fix them to use /[iI]/ instead of just /i/ inside the patterns.) SARE guys -- any chance the rules from comment 10 could be fixed in the distributed SARE rulesets to include the workaround? This is going to be a major FAQ once 3.2.0 is released, since perl 5.8.8 is still not that common. I'll attach a demo of one SARE rule fixed: spamassassin -Lt -p rule.cf < badmsg 2>&1 | grep 'Malformed UTF-8 character' | wc -l 872 spamassassin -Lt -p rule_fixed.cf < badmsg 2>&1 | grep 'Malformed UTF-8 character' | wc -l 0
Created attachment 3926 [details] badmsg (sample message to trigger the bug)
Created attachment 3927 [details] __SARE_OBFU_VISIT1 as it's currently distributed
Created attachment 3928 [details] a fixed __SARE_OBFU_VISIT1
Upgrading Perl to v5.8.8 worked for me (SpamAssassin v3.2.0-rc3 on Solaris 9) I was trying to avoid that because Sunfreeware doesn't have a prebuilt 5.8.8 so I had to build it from source.
I'll see what I can do about fixing the obfu rules and getting Doc to update the file on the site. :-) Justin, do you know if the (?i:words) syntax (or whatever it is exactly) is also broken so that I have to use character classes in all cases?
(In reply to comment #85) > Justin, do you know if the (?i:words) syntax (or whatever it is exactly) is > also broken so that I have to use character classes in all cases? Yep, any case where a part of the pattern is case-sensitive, and matching ISO-8859-1 characters, is broken. Case-insensitive bits are fine though.
*** Bug 5459 has been marked as a duplicate of this bug. ***
I have just committed fixed rule sets for SARE that cause this bug. Please let me know if any other UTF-8 issues are found that deal with SARE rules.
(In reply to comment #74) A hint for fellow Windows users trying to make SA work under Windows: Don't upgrade to perl 5.8.8 if you ever intend to use SA under Windows NT (broken command line parse). Later Windows versions seem to be OK, though.