Bug 3787

Summary: HTML email 'malformed UTF-8 character' warnings from SA
Product: Spamassassin Reporter: Loren Wilton <lwilton>
Component: spamassassinAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED WORKSFORME    
Severity: normal CC: jan_janousek, mailme, mchun.li, rmiller, rosenbaumlm
Priority: P3    
Version: SVN Trunk (Latest Devel Version)   
Target Milestone: 3.2.0   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments: Spam that causes 'malformed utf-8 character' complaints
Here is a short perl script demonstrating the problem without HTML::Parser
simple tweak of "http://rt.perl.org/rt3/Ticket/Display.html?id=37950" to apply/try to perl v587
badmsg (sample message to trigger the bug)
__SARE_OBFU_VISIT1 as it's currently distributed
a fixed __SARE_OBFU_VISIT1

Description Loren Wilton 2004-09-18 08:51:49 UTC
First, I apologise for not testing it on 3.0, I don't have that installed to 
test with.  Second, I believe I have the languages set up correctly, so this 
shouldn't be related to the UTF-8 problem in some Linux versions.

The attached spam, when run through 'spamassassin -t', and I presume other 
times, produces about 40 of the following sort of messages.  This might still 
be a problem in 3.0.

Malformed UTF-8 character (unexpected non-continuation byte 0x54, immediately 
after start byte 0xed) in transliteration (tr///) 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding 
start byte) in transliteration (tr///) 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding 
start byte) in transliteration (tr///) 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (unexpected continuation byte 0x86, with no preceding 
start byte) in transliteration (tr///) 
at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/PerMsgStatus.pm line 1293.

The particuular character value it complains about in each line varies.
Comment 1 Loren Wilton 2004-09-18 08:54:23 UTC
Created attachment 2352 [details]
Spam that causes 'malformed utf-8 character' complaints
Comment 2 Bob Menschel 2005-04-11 21:03:49 UTC
Loren, I cannot reproduce this with my svn copy of SA.  Can you reproduce on a
current version, or should we close this bug entry? 
Comment 3 Bob Menschel 2005-07-02 21:10:54 UTC
Unable to reproduce under most recent svn. Appears to have been fixed by Bug 4046

*** This bug has been marked as a duplicate of 4046 ***
Comment 4 John Gardiner Myers 2005-11-28 18:50:38 UTC
Not a duplicate.
Comment 5 John Gardiner Myers 2005-11-28 18:51:38 UTC
See attachment 3280 [details] for additional test case.
Comment 6 Dallas Engelken 2005-11-28 19:04:39 UTC
let me narrow this msg.txt.gz sample down for you....

----------------------------------------------
Content-Type: text/html; charset=us-ascii

<html><body>
TUMS&reg; Smoothies&trade;
</body></html>
----------------------------------------------

will produce many of these utf-8 warns.

my guess is the HTML::Parser decodes &reg; amnd &trade; and then the match
against SARE rules that contain things like  [\*ýýýý] and/or [\x96-\x97] causes
this?
Comment 7 John Gardiner Myers 2005-11-28 19:10:02 UTC
Do any of the SARE rules use \C ?  That would definitely do it.

Reducing to a single SARE rule would be helpful.
Comment 8 Dallas Engelken 2005-11-28 19:21:15 UTC
from the 70_sare_obfu.cf ruleset (because thats the only one i was testing),  
the following rules result in ~22k warns (~2300 per rule).

# grep UTF spamd.debug.txt  | cut -d\  -f 22 | sed -e 's/\,//g' | sort | uniq
__SARE_OBFU_CIALIS2
SARE_OBFU_GUARANTEE
__SARE_OBFU_MEDS2
SARE_OBFU_PRESCRIP
SARE_OBFU_PRESCR_SPL1
__SARE_OBFU_PRICE1
__SARE_OBFU_SOFT2
SARE_OBFU_VICODIN
__SARE_OBFU_VISIT1
SARE_OBFU_XANAX

# grep -c SARE_OBFU_CIALIS2 spamd.debug.txt
2345
Comment 9 Dallas Engelken 2005-11-28 19:22:04 UTC
FWIW, i dont get any utf-8 warns on Loren's sample....
Comment 10 openmacnews 2005-11-28 19:31:50 UTC
(i need a quicker box ... and some better skills with grep!)

with my full RDJ ruleset, i get:

% grep "Malformed UTF-8" spamd.debug.txt  | cut -d\  -f 22 | sed -e 's/\,//g' |
sort | uniq
SARE_OBFUAUCTION
SARE_OBFUFCK1
SARE_OBFUGIRLS
SARE_OBFUGNGBNG
SARE_OBFUHARDCORE
SARE_OBFUMONEY1
SARE_OBFUPENIS
SARE_OBFUPORNO
SARE_OBFUPUSS
SARE_OBFUTEENS
SARE_OBFUVRGN
SARE_OBFU_GUARANTEE
SARE_OBFU_PRESCRIP
SARE_OBFU_PRESCR_SPL1
SARE_OBFU_VICODIN
SARE_OBFU_XANAX
SARE_SPEC_REPL_OBFU1
SARE_SPEC_REPL_OBFU2
SARE_SPEC_REPL_OBFU3
SARE_SPEC_REPL_OBFU4
SARE_SPEC_REPL_OBFU5
SARE_SPEC_REPL_OBFU6
__SARE_OBFU_CIALIS2
__SARE_OBFU_MEDS2
__SARE_OBFU_PRICE1
__SARE_OBFU_SOFT2
__SARE_OBFU_VISIT1

apparently, only *OBFU* rules ...
Comment 11 openmacnews 2005-11-28 19:39:30 UTC
confirm no hits on loren's sample:

% grep -c "Malformed UTF-8" loren.log.txt 
  0
Comment 12 John Gardiner Myers 2005-11-28 19:48:06 UTC
I'm convinced Loren's problem is different than the recent SARE issue.  As
Loren's issue was WORKSFORME, let's just repurpose this bug for the SARE issue.
Comment 13 openmacnews 2005-11-28 19:54:49 UTC
also:

fwiw, for my sit'n,

    Hardware: Macintosh
    OS: Mac OS X (10.3.9 & 10.4.3)

Comment 14 Justin Mason 2005-11-28 22:12:45 UTC
so as I asked in bug 4691, the wrong bug ;), people seeing this issue, please
post the following:

 - the exact warning messages (including logged byte values)
 - perl version, from perl -V
 - whether or not the patch from bug 4691 is in use (just to make sure)

the perl version data in particular is useful.
Comment 15 Dallas Engelken 2005-11-28 22:30:29 UTC
bug 4046 had my sample message and debug output that contained the necessary 
bytes.  as far as i can see, its always "non-continuation byte 0x00".     
attachment 3279 [details] has an abbreviated debug of it.

if you want a full debug of attachment 3280 [details] (5.2MB 23k lines), i'd have to send 
that some other way since bz wont take it.  

i'm doing this on stock SVN, no patches,  Fedora core 3, perl 5.8.5, 
HTML::Parser 3.46.




Comment 16 openmacnews 2005-11-28 22:31:13 UTC
>  - the exact warning messages (including logged byte values)

http://www.mail-archive.com/dev@spamassassin.apache.org/msg11995.html

>  - perl version, from perl -V

as per http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4046,

% perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
	3.46
% spamassassin -V
	SpamAssassin version 3.2.0-r322462
	  running on Perl version 5.8.6

> the perl version data in particular is useful.

more here: http://paste.lisp.org/display/14093

i see the same with 5.8.7, as well

>  - whether or not the patch from bug 4691 is in use (just to make sure)

not here
Comment 17 Justin Mason 2005-11-28 22:36:52 UTC
'if you want a full debug of attachment 3280 [details] [edit] (5.2MB 23k lines), i'd have
to send that some other way since bz wont take it. '

thanks Dallas, I don't think so.  just a selection of the 'non-continuation
byte' warnings, to see if it's always 0x00 or not. (so far it seems not)
Comment 18 John Gardiner Myers 2005-12-02 00:17:12 UTC
Unable to reproduce with current SVN plus 70_sare_obfu.cf.

./spamassassin -V
SpamAssassin version 3.2.0-r322462
  running on Perl version 5.8.6

Tried with perl 5.8.6, 5.8.7, and 5.8.4

HTML::Parser version 3.45
Comment 19 openmacnews 2005-12-02 00:24:29 UTC
john,

r322462?

i can reproduce at will with (at least) r349275 and later ...

seems Dallas may be able to as well, but I'm sure he can chime in for himself :-)

can you, perhaps, verify your "unable to reproduce" with a more recent version?

richard
Comment 20 John Gardiner Myers 2005-12-02 00:35:35 UTC
It's r351501.
Comment 21 openmacnews 2005-12-02 01:26:54 UTC
agh, my mistake. sorry ... i *still* get occassionally confused by the co'd
revision, and that which spamassassin -v reports :-/
Comment 22 Dallas Engelken 2005-12-02 22:36:06 UTC
tested r351835 on rh73 perl 5.6.1 and dont have any UTF-8 warns there... but 
didnt expect to since UTF-8 stuff only started showing up in rh8 if i remember 
correctly.wgwg

tested r351835 on fc3 perl 5.8.5-18 (5.8.5-17 had a UTF-8 fix mentioned in the 
CHANGLOG and I thought that was it but wasnt)...  UTF-8 warns all over.

just now tested on fc4 perl 5.8.6-4, sa r351835, and i get UTF-8 warns here as 
well with 70_sare_obfu.cf running.

i'm sure anyone running fc4 can reproduce this... as this was a brand new 
install of fc4 here.

1) checkout svn
2) perl Makefile.PL, make install
3) copy 70_sare_obfu.cf to /etc/mail/spamassassin
4) in one shell, start spamd like so...  $ spamd -D -L
5) in another shell, create the offending file as shown above in comment #6 and 
save it.
6) $ cat file | spamc
7) watch other window for debug of the scan

Comment 23 John Gardiner Myers 2005-12-03 04:17:36 UTC
Dallas, does the problem reproduce using the spamassassin script directly, or is
it necessary to use spamd?
Comment 24 Dallas Engelken 2005-12-04 01:02:50 UTC
on fc3

# cat test  | spamassassin -D 2>&1 | grep -c "Malformed UTF"
351

on fc4

# cat test  | spamassassin -D 2>&1 | grep -c "Malformed UTF"
858

the file 'test' is identical on both test systems... so go figure :)
Comment 25 openmacnews 2005-12-04 01:18:17 UTC
dallas:  same version of perl for both, or:

   fc3 perl 5.8.5-18
   fc4 perl 5.8.6-4

as above?
Comment 26 Dallas Engelken 2005-12-05 23:13:20 UTC
fc3 is now perl-5.8.5-20
fc4 is now perl-5.8.6-18

utf warns still present after those updates.


Comment 27 openmacnews 2005-12-05 23:22:28 UTC
fyi, HTML::Parser has been upgraded ...

  HTML::Parser  3.48  G/GA/GAAS/HTML-Parser-3.48.tar.gz
  cref: http://search.cpan.org/src/GAAS/HTML-Parser-3.48/Changes
Comment 28 Dallas Engelken 2005-12-05 23:33:24 UTC
ya, i tested that friday.. but it was just a change reverted from 3.47 
according to the changelog if i remember right.
Comment 29 John Gardiner Myers 2005-12-14 00:20:53 UTC
I've been able to reproduce errors with HTML::Parser 3.45 and earlier.  The bug
is fixed in HTML::Parser 3.46

https:://rt.cpan.org/Ticket/Display.html?id=15068

I believe we should bump the minimum HTML::Parser version to 3.46
Comment 30 Dallas Engelken 2005-12-14 01:48:00 UTC
as far as i see, 3.46 doesnt fix this.  nor do 3.47 or 3.48.  this was just now 
tested on a FC4 box with perl 5.8.6

# spamd -D -L > spamd.debug 2>&1 
# perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
3.46
# cat /root/test  | spamc
# grep -c Malformed spamd.debug
858

# spamd -D -L > spamd.debug 2>&1
# perl -e 'use HTML::Parser; print HTML::Parser->VERSION . "\n"';
3.48
# cat /root/test  | spamc
# grep -c Malformed spamd.debug
858

Comment 31 openmacnews 2005-12-14 10:58:02 UTC
i've been searching thru some of the perl lists re: "malformed UTF-8" error.

scads of hits, actually ... very few cogent explanations/resolutions that i've
found so far, tho :-/

on the exim list, however, i just noted the following from Philip Hazel, (author
of Exim & PCRE):

http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20051212/msg00097.html

which keeps nagging at me in *this* context.

just (pseudo)random thought ...
Comment 32 John Gardiner Myers 2005-12-14 17:14:41 UTC
The citation in comment 31 is not useful.

I suspect the problem might be specific to Fedora Core, since that appears to be
common to all reporters.  I suggest trying to reproduce with a stock perl.
Comment 33 Dallas Engelken 2005-12-14 17:30:21 UTC
correct me if i'm wrong, but i believe richard has the same results on all 
version of OSX hes tested.
Comment 34 openmacnews 2005-12-14 17:40:10 UTC
dallas,

that's correct. all combos of:

  OSX 10.3.9/10.4.3
  Perl 5.8.5/5.8.6/5.8.7
  HTML::Parser 3.45/3.46/3.48 (didn't test 3.47)

reproduce the error(s) for me.

fwiw, some of the tests were on different boxes, as well ...

richard
Comment 35 Justin Mason 2005-12-14 19:29:59 UTC
given that Dallas has noted that the min-version bump checked into trunk didn't
fix it, can we revert that change?

I'm asking because Ubuntu, at least, is not yet distributing 3.46; I had to hit
CPAN for it.   in my opinion that implies that it's too "bleeding-edge" a
requirement.

Also, I note that that H:P bug does not seem to relate at all to the messages
posted here as test cases -- I see no "A0" bytes in either.

PS: more env details requests.  Could everyone post the output of

    echo $LANG
    set | grep LC_

I wonder if it's something to do with a UTF-8 locale setting.  It'd be
good to discount that possibility.


Also, Dallas and OpenMacNews -- can you *attach* *corresponding* test messages
and some "Malformed UTF-8" lines produced by those messages?  Right now the
test messages are on various pastebots, at dead URLs, scattered between bugs
etc. and it's impossible to reliably map one to the other. (Attaching them to
bugs is very important so that the URLs won't "rot" over time.)

Comment 36 John Gardiner Myers 2005-12-14 19:46:01 UTC
The min-version bump got reverted last night since it broke the buildbots. 
There is definitely a reproducable bug with U+00A0 characters which is fixed by
3.46, though one probably needs to have not-yet-committed charset normalization
changes in order to trigger it.

By the time SpamAssassin 3.2 is released, HTML::Parser 3.46 or later will have
had time to propagate.

[jgmyers@pong spamweights]$ echo $LANG
en_US
[jgmyers@pong spamweights]$ set | grep LC_
[jgmyers@pong spamweights]$ 
Comment 37 Justin Mason 2005-12-14 20:26:46 UTC
'The min-version bump got reverted last night since it broke the buildbots.'

yep, the rest of my email eventually came through and I realised that ;)
thanks.
Comment 38 openmacnews 2005-12-14 20:33:02 UTC
per request & for completeness ...

% echo $LANG
  en_US
% set | grep LC_
%

% uname -a
	Darwin devbox 8.3.0 Darwin Kernel Version 8.3.0: Mon Oct  3 20:04:04 PDT 2005;
root:xnu-792.6.22.obj~2/RELEASE_PPC Power Macintosh powerpc

% perl -V
	Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
    osname=darwin, osvers=8.2.0, archname=darwin-thread-multi-2level
    uname='darwin devbox 8.2.0 darwin kernel version 8.2.0: fri jun 24 17:46:54
pdt 2005; root:xnu-792.2.4.obj~3release_ppc power macintosh powerpc '

% perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
	3.48

with:

    /var/MailServer/Conf/SA/Dist populated by the SA distro,

and,

    /var/MailServer/Conf/SA/Local populated by RDJ,

with:

================================================================
${EDITOR} .../rules_du_jour.conf

...
TRUSTED_RULESETS="SARE_REDIRECT_POST300 SARE_EVILNUMBERS0 SARE_EVILNUMBERS1
SARE_BAYES_POISON_NXM SARE_HEADER SARE_HEADER_ENG SARE_FRAUD SARE_SPOOF
SARE_RANDOM SARE_SPAMCOP_TOP200 SARE_OEM SARE_UNSUB SARE_URI_ENG BOGUSVIRUS
SARE_SPECIFIC SARE_OEM SARE_HTML SARE_OBFU SARE_GENLSUBJ SARE_GENLSUBJ_ENG
SARE_ADULT SARE_BML TRIPWIRE"

SA_DIR="/var/MailServer/Conf/SA/Local"
...
================================================================

with *EITHER* the 'simple' test message as above:

XXXXXX == /tmp/test.txt
	----------------------------------------------
	<html><body>
	TUMS&reg; Smoothies&trade;
	</body></html>
	----------------------------------------------

or XXXXXX == /tmp/"Attachment 3280 [details]"
(http://issues.apache.org/SpamAssassin/attachment.cgi?id=3280)

NOW, 

% cat /tmp/XXXXXX | \
spamassassin -L \
--lint \
--debug \
--siteconfigpath=/var/MailServer/Conf/SA/Dist \
--configpath=/var/MailServer/Conf/SA/Local \
--prefs-file=local.cf \
--nocreate-prefs >& /tmp/openmacnews.spamd.debug

% grep -c "Malformed" /tmp/openmacnews.spamd.debug
	0

verified on 3 different boxes, similarly config'd ...

!!!???

ok, i'm confused ... have the SARE Rules been updated when noones looking?


richard
Comment 39 openmacnews 2005-12-14 21:15:49 UTC
repeating with:

cat /tmp/msg.txt.attachment00 | \
spamassassin -L \
--lint \
--debug \
--siteconfigpath=/var/DarkMatter/MailServer/Conf/SA/Dist \
--configpath=/var/DarkMatter//MailServer/Conf/SA/Local \
--prefs-file=local.cf \
--nocreate-prefs >& /tmp/openmacnews.spamd.debug

for cases:


% perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
	3.48
% grep -c "Malformed" /tmp/openmacnews.spamd.debug
	0

% perl -e 'use HTML::Parser ...
	3.47
% grep -c "Malformed ...
	0

% perl -e 'use HTML::Parser ...
	3.46
% grep -c "Malformed ...
	0

% perl -e 'use HTML::Parser ...
	3.45
% grep -c "Malformed ...
	0
Comment 40 Dallas Engelken 2005-12-14 21:51:48 UTC
i just looked through the revision logs on svn at SARE and see no recent 
changes to any obfu tests in any ruleset that appears to fire up these 
malformed utf-8's.  70_sare_adult.cf, 70_sare_obfu.cf, and 70_sare_specific.cf 
all have obfu that causes utf-8 warns.  once i remove those dozen or so 
offending rules, i have no issues.

are you sure you have one or more of those files in..
/var/DarkMatter//MailServer/Conf/SA/Local

does spamd debug show it loading? like...
[19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf
Comment 41 Dallas Engelken 2005-12-14 21:59:31 UTC
I trimmed out the duplicate utf-8 warns in the debug below to leave 1 warn for 
every unique rule in 70_sare_obfu.cf that triggers the warn.  I wasnt running 
70_sare_adult.cf or 70_sare_specific during this test, so those obfu rules in 
there that trigger are not present in this debug.


# echo $LANG
en_US

# set | grep LC_
#

# perl -e 'use HTML::Parser; print HTML::Parser->VERSION . "\n"';
3.46

# svn info /tmp/spamassassin-trunk/
Path: /tmp/spamassassin-trunk
URL: http://svn.apache.org/repos/asf/spamassassin/trunk
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 356857

# spamassassin -V
SpamAssassin version 3.2.0-r356425
  running on Perl version 5.8.6

# ls -la /etc/mail/spamassassin/
-rw-r--r--  1 root root 158513 Oct  1 15:00 70_sare_obfu.cf
-rw-r--r--  1 root root    890 Sep 15 13:23 init.pre
-rw-r--r--  1 root root   1208 Sep 15 13:23 local.cf
-rw-r--r--  1 root root   2397 Sep 15 13:23 v310.pre

# ls -la /usr/share/spamassassin/
-rw-r--r--    1 root root   5495 Dec 14 14:02 10_default_prefs.cf
-rw-r--r--    1 root root  14312 Dec 14 14:02 20_dnsbl_tests.cf
-rw-r--r--    1 root root  17642 Dec 14 14:02 20_html_tests.cf
-rw-r--r--    1 root root   2164 Dec 14 14:02 20_net_tests.cf
-rw-r--r--    1 root root   2334 Dec 14 14:02 23_bayes.cf
-rw-r--r--    1 root root    420 Dec 14 14:02 25_accessdb.cf
-rw-r--r--    1 root root   1345 Dec 14 14:02 25_antivirus.cf
-rw-r--r--    1 root root    190 Dec 14 14:02 25_dcc.cf
-rw-r--r--    1 root root   1947 Dec 14 14:02 25_domainkeys.cf
-rw-r--r--    1 root root   2738 Dec 14 14:02 25_hashcash.cf
-rw-r--r--    1 root root    189 Dec 14 14:02 25_pyzor.cf
-rw-r--r--    1 root root   2201 Dec 14 14:02 25_razor2.cf
-rw-r--r--    1 root root   2873 Dec 14 14:02 25_spf.cf
-rw-r--r--    1 root root    352 Dec 14 14:02 25_textcat.cf
-rw-r--r--    1 root root   6544 Dec 14 14:02 25_uribl.cf
-rw-r--r--    1 root root   1116 Dec 14 14:02 60_awl.cf
-rw-r--r--    1 root root   4906 Dec 14 14:02 60_whitelist.cf
-rw-r--r--    1 root root   1726 Dec 14 14:02 60_whitelist_subject.cf
-rw-r--r--    1 root root 101479 Dec 14 14:02 languages
-rw-r--r--    1 root root  18944 Dec 14 14:02 triplets.txt
-rw-r--r--    1 root root   1869 Dec 14 14:02 user_prefs.template

# cat /root/test | spamc
X-Spam-Checker-Version: SpamAssassin 3.2.0-r356425 (2005-12-12) on
        asset.nmgi.com
X-Spam-Level: ****
X-Spam-Status: No, score=4.0 required=5.0 tests=HTML_60_70,HTML_MESSAGE,
        HTML_MISSING_CTYPE,HTML_SHORT_LENGTH autolearn=no
        version=3.2.0-r356425
Content-Type: text/html; charset=us-ascii

<html><body>
TUMS&reg; Smoothies&trade;
</body></html>




# spamd -D -L > spamd.out 2>&1 ^C
# cat spamd.out

[19475] dbg: logger: adding facilities: all
[19475] dbg: logger: logging level is DBG
[19475] dbg: logger: trying to connect to syslog/unix...
[19475] dbg: logger: opening syslog with unix socket
[19475] dbg: logger: successfully connected to syslog/unix
[19475] dbg: logger: successfully added syslog method
[19475] dbg: spamd: creating INET socket:
[19475] dbg: spamd:  Listen: 128
[19475] dbg: spamd:  LocalAddr: 127.0.0.1
[19475] dbg: spamd:  LocalPort: 783
[19475] dbg: spamd:  Proto: 6
[19475] dbg: spamd:  ReuseAddr: 1
[19475] dbg: spamd:  Type: 1
[19475] dbg: logger: adding facilities: all
[19475] dbg: logger: logging level is DBG
[19475] dbg: generic: SpamAssassin version 3.2.0-r356425
[19475] dbg: config: score set 0 chosen.
[19475] dbg: dns: no ipv6
[19475] dbg: dns: is Net::DNS::Resolver available? yes
[19475] dbg: dns: Net::DNS version: 0.49
[19475] dbg: dns: name server: 172.17.1.10, LocalAddr: 0.0.0.0
[19475] dbg: spamd: Preloading modules with HOME=/tmp/spamd-19475-init
[19475] dbg: ignore: test message to precompile patterns and load modules
[19475] dbg: config: using "/etc/mail/spamassassin" for site rules pre files
[19475] dbg: config: read file /etc/mail/spamassassin/init.pre
[19475] dbg: config: read file /etc/mail/spamassassin/v310.pre
[19475] dbg: config: using "/usr/share/spamassassin" for sys rules pre files
[19475] dbg: config: using "/usr/share/spamassassin" for default rules dir
[19475] dbg: config: read file /usr/share/spamassassin/10_default_prefs.cf
[19475] dbg: config: read file /usr/share/spamassassin/20_dnsbl_tests.cf
[19475] dbg: config: read file /usr/share/spamassassin/20_html_tests.cf
[19475] dbg: config: read file /usr/share/spamassassin/20_net_tests.cf
[19475] dbg: config: read file /usr/share/spamassassin/23_bayes.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_accessdb.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_antivirus.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_dcc.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_domainkeys.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_hashcash.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_pyzor.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_razor2.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_spf.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_textcat.cf
[19475] dbg: config: read file /usr/share/spamassassin/25_uribl.cf
[19475] dbg: config: read file /usr/share/spamassassin/60_awl.cf
[19475] dbg: config: read file /usr/share/spamassassin/60_whitelist.cf
[19475] dbg: config: read file /usr/share/spamassassin/60_whitelist_subject.cf
[19475] dbg: config: using "/etc/mail/spamassassin" for site rules dir
[19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf
[19475] dbg: config: read file /etc/mail/spamassassin/local.cf
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::URIDNSBL from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::URIDNSBL=HASH
(0x8d3a2c0)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::Hashcash from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::Hashcash=HASH
(0x8d28048)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::SPF from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::SPF=HASH(0x8d69c30)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::Pyzor from @INC
[19475] dbg: pyzor: local tests only, disabling Pyzor
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::Pyzor=HASH
(0x8e5e4f4)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::SpamCop from @INC
[19475] dbg: reporter: local tests only, disabling SpamCop
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::SpamCop=HASH
(0x8ed6dd0)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::AWL from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::AWL=HASH(0x8efd740)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::AutoLearnThreshold 
from @INC
[19475] dbg: plugin: registered 
Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH(0x8f08558)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::WhiteListSubject from 
@INC
[19475] dbg: plugin: registered 
Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::MIMEHeader from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::MIMEHeader=HASH
(0x8f1e3fc)
[19475] dbg: plugin: loading Mail::SpamAssassin::Plugin::ReplaceTags from @INC
[19475] dbg: plugin: registered Mail::SpamAssassin::Plugin::ReplaceTags=HASH
(0x8f2e1b0)
[19475] dbg: plugin: Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0x8f2e1b0) 
implements 'finish_parsing_end'
[19475] dbg: replacetags: replacing tags
[19475] dbg: replacetags: done replacing tags
[19475] dbg: bayes: no dbs present, cannot tie DB R/O: /tmp/spamd-19475-
init/.spamassassin/bayes_toks
[19475] dbg: config: score set 0 chosen.
[19475] dbg: message: ---- MIME PARSER START ----
[19475] dbg: message: main message type: text/plain
[19475] dbg: message: parsing normal part
[19475] dbg: message: added part, type: text/plain
[19475] dbg: message: ---- MIME PARSER END ----
[19475] dbg: bayes: no dbs present, cannot tie DB R/O: /tmp/spamd-19475-
init/.spamassassin/bayes_toks
[19475] dbg: dns: is DNS available? 0
[19475] dbg: metadata: X-Spam-Relays-Trusted:
[19475] dbg: metadata: X-Spam-Relays-Untrusted:
[19475] dbg: message: no encoding detected
[19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) 
implements 'parsed_metadata'
[19475] dbg: rules: local tests only, ignoring RBL eval
[19475] dbg: check: running tests for priority: 0
[19475] dbg: rules: running header regexp tests; score so far=0
[19475] dbg: plugin: registering glue method for check_hashcash_value 
(Mail::SpamAssassin::Plugin::Hashcash=HASH(0x8d28048))
[19475] dbg: plugin: registering glue method for check_hashcash_double_spend 
(Mail::SpamAssassin::Plugin::Hashcash=HASH(0x8d28048))
[19475] dbg: eval: all '*From' addrs: ignore@compiling.spamassassin.taint.org
[19475] dbg: eval: all '*To' addrs:
[19475] dbg: plugin: registering glue method for check_subject_in_blacklist 
(Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80))
[19475] dbg: plugin: registering glue method for check_subject_in_whitelist 
(Mail::SpamAssassin::Plugin::WhiteListSubject=HASH(0x8f14a80))
[19475] dbg: rules: running body-text per-line regexp tests; score so far=0
[19475] dbg: uri: running uri tests; score so far=0
[19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0
[19475] dbg: rules: running full-text regexp tests; score so far=0
[19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) 
implements 'check_tick'
[19475] dbg: check: running tests for priority: 500
[19475] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x8d3a2c0) 
implements 'check_post_dnsbl'
[19475] dbg: rules: running meta tests; score so far=0
[19475] dbg: rules: running header regexp tests; score so far=0
[19475] dbg: rules: running body-text per-line regexp tests; score so far=0
[19475] dbg: uri: running uri tests; score so far=0
[19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0
[19475] dbg: rules: running full-text regexp tests; score so far=0
[19475] dbg: check: running tests for priority: 1000
[19475] dbg: rules: running meta tests; score so far=0
[19475] dbg: rules: running header regexp tests; score so far=0
[19475] dbg: plugin: registering glue method for check_from_in_auto_whitelist 
(Mail::SpamAssassin::Plugin::AWL=HASH(0x8efd740))
[19475] dbg: locker: safe_lock: created /tmp/spamd-19475-
init/.spamassassin/auto-whitelist.lock.asset.nmgi.com.19475
[19475] dbg: locker: safe_lock: trying to get lock on /tmp/spamd-19475-
init/.spamassassin/auto-whitelist with 0 retries
[19475] dbg: locker: safe_lock: link to /tmp/spamd-19475-
init/.spamassassin/auto-whitelist.lock: link ok
[19475] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W 
in /tmp/spamd-19475-init/.spamassassin/auto-whitelist
[19475] dbg: auto-whitelist: db-based 
ignore@compiling.spamassassin.taint.org|ip=none scores 0/0
[19475] dbg: auto-whitelist: AWL active, pre-score: 0, autolearn score: 0, 
mean: undef, IP: undef
[19475] dbg: auto-whitelist: DB addr list: untie-ing and unlocking
[19475] dbg: auto-whitelist: DB addr list: file locked, breaking lock
[19475] dbg: locker: safe_unlock: unlink /tmp/spamd-19475-
init/.spamassassin/auto-whitelist.lock
[19475] dbg: auto-whitelist: post auto-whitelist score: 0
[19475] dbg: rules: running body-text per-line regexp tests; score so far=0
[19475] dbg: uri: running uri tests; score so far=0
[19475] dbg: rules: running raw-body-text per-line regexp tests; score so far=0
[19475] dbg: rules: running full-text regexp tests; score so far=0
[19475] dbg: check: is spam? score=0 required=5
[19475] dbg: check: tests=
[19475] dbg: check: subtests=
[19475] dbg: config: copying current conf to backup
[19475] info: spamd: server started on port 783/tcp (running version 3.2.0-
r356425)
[19475] info: spamd: server pid: 19475
[19475] info: spamd: server successfully spawned child process, pid 19478
[19475] dbg: prefork: child 19478: entering state 0
[19478] dbg: prefork: sysread(8) not ready, wait max 300 secs
[19475] dbg: prefork: new lowest idle kid: none
[19479] dbg: prefork: sysread(9) not ready, wait max 300 secs
[19475] info: spamd: server successfully spawned child process, pid 19479
[19475] dbg: prefork: child 19479: entering state 0
[19475] dbg: prefork: new lowest idle kid: none
[19475] dbg: prefork: child 19478: entering state 1
[19475] dbg: prefork: new lowest idle kid: 19478
[19475] dbg: prefork: child reports idle
[19475] dbg: prefork: child 19479: entering state 1
[19475] dbg: prefork: new lowest idle kid: 19478
[19475] dbg: prefork: child reports idle
[19475] info: prefork: child states: II
[19475] dbg: prefork: ordered 19478 to accept
[19475] dbg: prefork: sysread(7) not ready, wait max 300 secs
[19478] info: spamd: connection from localhost.localdomain [127.0.0.1] at port 
34629
[19475] dbg: prefork: child 19478: entering state 2
[19475] dbg: prefork: new lowest idle kid: 19479
[19478] info: spamd: setuid to root succeeded
[19478] dbg: info: user has changed
[19478] dbg: bayes: no dbs present, cannot tie DB 
R/O: /root/.spamassassin/bayes_toks
[19478] dbg: config: score set 0 chosen.
[19478] warn: spamd: still running as root: user not specified with -u, not 
found, or set to root, falling back to nobody at /usr/bin/spamd line 1152, 
<GEN5>
line 4.
[19478] info: spamd: processing message (unknown) for root:99
[19478] dbg: dns: name server: 172.17.1.10, LocalAddr: 0.0.0.0
[19478] dbg: bayes: no dbs present, cannot tie DB 
R/O: /root/.spamassassin/bayes_toks
[19478] dbg: metadata: X-Spam-Relays-Trusted:
[19478] dbg: metadata: X-Spam-Relays-Untrusted:
[19478] dbg: message: ---- MIME PARSER START ----
[19478] dbg: message: main message type: text/html
[19478] dbg: message: parsing normal part
[19478] dbg: message: added part, type: text/html
[19478] dbg: message: ---- MIME PARSER END ----
[19478] dbg: message: no encoding detected
[19478] dbg: rules: local tests only, ignoring RBL eval
[19478] dbg: check: running tests for priority: 0
[19478] dbg: rules: running header regexp tests; score so far=0
[19478] dbg: eval: all '*From' addrs:
[19478] dbg: eval: all '*To' addrs:
[19478] dbg: rules: running body-text per-line regexp tests; score so far=0
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xcf) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_PRICE1, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd1) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_PRESCR_SPL1, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd5) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_SOFT2, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xce) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_VICODIN, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd0) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_CIALIS2, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd1) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_PRESCRIP, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xce) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_VISIT1, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd2) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_XANAX, line 1, <GEN5> 
line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xc4) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule SARE_OBFU_GUARANTEE, line 1, 
<GEN5> line 10.
[19478] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xd0) in pattern match (m//) 
at /etc/mail/spamassassin/70_sare_obfu.cf, rule __SARE_OBFU_MEDS2, line 1, 
<GEN5> line 10.
[19478] dbg: uri: running uri tests; score so far=0
[19478] dbg: rules: ran eval rule HTML_SHORT_LENGTH ======> got hit (1)
[19478] dbg: rules: ran eval rule __HTML_LENGTH_512 ======> got hit (1)
[19478] dbg: rules: ran eval rule HTML_60_70 ======> got hit (1)
[19478] dbg: bayes: no dbs present, cannot tie DB 
R/O: /root/.spamassassin/bayes_toks
[19478] dbg: bayes: not scoring message, returning undef
[19478] dbg: bayes: opportunistic call attempt failed, DB not readable
[19478] dbg: rules: ran eval rule HTML_MESSAGE ======> got hit (1)
[19478] dbg: rules: ran eval rule __HTML_LENGTH_384 ======> got hit (1)
[19478] dbg: rules: ran eval rule __HTML_LENGTH_0000_1024 ======> got hit (1)
[19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=3
[19478] dbg: rules: running full-text regexp tests; score so far=3
[19478] dbg: check: running tests for priority: 500
[19478] dbg: rules: running meta tests; score so far=3
[19478] dbg: rules: running header regexp tests; score so far=4
[19478] dbg: rules: running body-text per-line regexp tests; score so far=4
[19478] dbg: uri: running uri tests; score so far=4
[19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=4
[19478] dbg: rules: running full-text regexp tests; score so far=4
[19478] dbg: check: running tests for priority: 1000
[19478] dbg: rules: running meta tests; score so far=4
[19478] dbg: rules: running header regexp tests; score so far=4
[19478] dbg: rules: running body-text per-line regexp tests; score so far=4
[19478] dbg: uri: running uri tests; score so far=4
[19478] dbg: rules: running raw-body-text per-line regexp tests; score so far=4
[19478] dbg: rules: running full-text regexp tests; score so far=4
[19478] dbg: plugin: Mail::SpamAssassin::Plugin::AutoLearnThreshold=HASH
(0x8f08558) implements 'autolearn_discriminator'
[19478] dbg: learn: auto-learn: currently using scoreset 0
[19478] dbg: learn: auto-learn: message score: 4, computed score for autolearn: 
4
[19478] dbg: learn: auto-learn? ham=0.1, spam=12, body-points=3, head-points=0, 
learned-points=0
[19478] dbg: learn: auto-learn? no: inside auto-learn thresholds, not 
considered ham or spam
[19478] dbg: check: is spam? score=4 required=5
[19478] dbg: check: 
tests=HTML_60_70,HTML_MESSAGE,HTML_MISSING_CTYPE,HTML_SHORT_LENGTH
[19478] dbg: check: 
subtests=__HTML_LENGTH_0000_1024,__HTML_LENGTH_384,__HTML_LENGTH_512
[19478] info: spamd: clean message (4.0/5.0) for root:99 in 0.0 seconds, 99 
bytes.
[19478] info: spamd: result: .  4 - 
HTML_60_70,HTML_MESSAGE,HTML_MISSING_CTYPE,HTML_SHORT_LENGTH 
scantime=0.0,size=99,user=root,uid=99,required_score=5.0,rhos
t=localhost.localdomain,raddr=127.0.0.1,rport=34629,mid=(unknown),autolearn=no
[19478] dbg: config: copying current conf from backup
[19475] dbg: prefork: child 19478: entering state 1
[19475] dbg: prefork: new lowest idle kid: 19478
[19475] dbg: prefork: child reports idle
[19475] info: prefork: child states: II
[19478] dbg: prefork: sysread(8) not ready, wait max 300 secs
[19475] info: spamd: server killed by SIGINT, shutting down
Comment 42 openmacnews 2005-12-14 22:02:16 UTC
just reproduced the "0" errors on another box ... <grumble>

> are you sure you have one or more of those files in..
/var/MailServer/Conf/SA/Local

yup.

% pwd
	/var/MailServer/Conf/SA/Local
% ls *.cf
	70_sare_adult.cf             70_sare_header.cf      70_sare_specific.cf    
72_sare_redirect_post3.0.0.cf
	70_sare_bayes_poison_nxm.cf  70_sare_header_eng.cf  70_sare_spoof.cf       
99_sare_fraud_post25x.cf
	70_sare_evilnum0.cf          70_sare_html.cf        70_sare_unsub.cf       
bogus-virus-warnings.cf
	70_sare_evilnum1.cf          70_sare_obfu.cf        70_sare_uri_eng.cf     
local.cf
	70_sare_genlsubj.cf          70_sare_oem.cf         70_sc_top200.cf        
tripwire.cf
	70_sare_genlsubj_eng.cf      70_sare_random.cf      72_sare_bml_post25x.cf

> does spamd debug show it loading? like...
> [19475] dbg: config: read file /etc/mail/spamassassin/70_sare_obfu.cf

yup again.

% grep _sare_ /tmp/openmacnews.spamd.debug 
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_adult.cf
[21273] dbg: config: read file
/var/MailServer/Conf/SA/Local/70_sare_bayes_poison_nxm.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_evilnum0.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_evilnum1.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_genlsubj.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_genlsubj_eng.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_header.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_header_eng.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_html.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_obfu.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_oem.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_random.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_specific.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_spoof.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_unsub.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/70_sare_uri_eng.cf
[21273] dbg: config: read file /var/MailServer/Conf/SA/Local/72_sare_bml_post25x.cf
[21273] dbg: config: read file
/var/MailServer/Conf/SA/Local/72_sare_redirect_post3.0.0.cf
[21273] dbg: config: read file
/var/MailServer/Conf/SA/Local/99_sare_fraud_post25x.cf
Comment 43 openmacnews 2005-12-14 22:16:11 UTC
dallas,

squinting at differences, i note in your 'cat',

    [19475] dbg: dns: Net::DNS version: 0.49

try:

    % perl -e 'use Net::DNS; print Net::DNS->VERSION';
      0.54

?

other than that, not much ...
Comment 44 Dallas Engelken 2005-12-14 22:23:44 UTC
can you verifiy your spamassassin -V again.  i didnt see it in comment 38 or 
comment 39.   

# perl -e 'use Net::DNS; print Net::DNS->VERSION . "\n"';
0.55
# cat /root/test | spamassassin 2>&1  | grep -c Malformed
351
Comment 45 openmacnews 2005-12-14 22:31:35 UTC
% which spamassassin
   /usr/local/spamassassin/bin/spamassassin

% ls -al /usr/local/spamassassin/bin/spamassassin
   -r-xr-xr-x 1 root wheel 24979 Dec  8 09:38
/usr/local/spamassassin/bin/spamassassin

% spamassassin -V
  SpamAssassin version 3.2.0-r322462
    running on Perl version 5.8.7
Comment 46 Justin Mason 2005-12-14 23:03:33 UTC
could be the HTML::Parser dependencies, esp HTML::Entities.

: jm 285...; perl -e 'use HTML::Entities; print $HTML::Entities::VERSION,"\n"'
1.32
: exit=0 Wed Dec 14 14:01:20 PST 2005; cd /home/jm/ftp/spamassassin
: jm 286...; perl -e 'use HTML::Parser; print $HTML::Parser::VERSION,"\n"'
3.48
: exit=0 Wed Dec 14 14:01:28 PST 2005; cd /home/jm/ftp/spamassassin
: jm 287...; perl -e 'use HTML::Tagset; print $HTML::Tagset::VERSION,"\n"'
3.04
Comment 47 Dallas Engelken 2005-12-14 23:21:56 UTC
# perl -e 'use HTML::Parser; print $HTML::Parser::VERSION,"\n"'
3.46
# perl -e 'use HTML::Tagset; print $HTML::Tagset::VERSION,"\n"'
3.04
# perl -e 'use HTML::Entities; print $HTML::Entities::VERSION,"\n"'
1.32
Comment 48 Justin Mason 2005-12-15 00:14:08 UTC
ok, got it reproduced, and it's either a perl or HTML::Parser bug.  I've
narrowed it down to a standalone perl script.   More details at:

https://rt.cpan.org/Ticket/Display.html?id=16495
Comment 49 Justin Mason 2005-12-15 00:32:45 UTC
wtf -- rt.cpan.org doesn't seem to be recording my comments on that bug!  very
annoying.  here's the demo script.

http://taint.org/xfer/2005/demo_utf8_bug.pl
Comment 50 openmacnews 2005-12-15 00:47:34 UTC
here,

HTML::Entities -> 1.32
HTML::Parser   -> 3.48
HTML::Tagset   -> 3.10
Comment 51 openmacnews 2005-12-15 00:54:37 UTC
yup. it's back. here's my output from exec'ing
http://taint.org/xfer/2005/demo_utf8_bug.pl:

http://paste.lisp.org/display/14647
Comment 52 John Gardiner Myers 2005-12-15 01:16:36 UTC
Working on filing a bug against Perl.
Comment 53 John Gardiner Myers 2005-12-15 01:25:39 UTC
Filed http://rt.perl.org/rt3/Ticket/Display.html?id=37950
(May need to enter userid guest password guest)
Comment 54 John Gardiner Myers 2005-12-15 01:58:04 UTC
A workaround is to wrap (?-i: around the non-ASCII characters in the SARE patterns:

/(?-i:\xC4)|a/i;

instead of:

/\xC4|a/i;
Comment 55 Dallas Engelken 2005-12-15 03:10:18 UTC
yup, lookin good here 

# perl demo_utf8_bug.pl 2>&1 | grep -c Malformed
64






Comment 56 openmacnews 2005-12-15 03:29:18 UTC
> A workaround is to wrap (?-i: around the non-ASCII characters in the SARE
patterns:

is there any reason NOT to? if it works, it's independent of a perl fix ... yes?
Comment 57 John Gardiner Myers 2005-12-15 07:28:55 UTC
(In reply to comment #56)
> is there any reason NOT to?

Depends on whether or not each rule really wants case independence for those
characters.

> if it works, it's independent of a perl fix ... yes?

Yes, it would be independent of a perl fix.  If the rule really didn't want case
independence, it could even prevent theoretical FPs.
Comment 58 Sidney Markowitz 2005-12-15 07:35:33 UTC
Here is another workaround that seems to work in the demo script at
http://taint.org/xfer/2005/demo_utf8_bug.pl and doesn't require changing all the
rule patterns:

Insert  use encoding 'utf8';  at the beginning of the block that defines the
patterns that contain the \xC4 and so on bytes, i.e.,

  sub run_regexp {
    use encoding 'utf8';

I haven't tried adding that to whatever processes the rule regexps. Can someone
who knows where that would go put it in and see if it fixes the problem?

Alternatively,

  utf8::encode($text);

on the output from HTML::Parser solves the problem from the other direction.
That is, the first changes the pattern , the second changes the parsed text
string, either one working to get them to match each other. I'm just more wary
about changing the parsed text string because that seems to me more likely to
have side effects. I'm open to comments from someone who better understands the
internals of Perl strings and UTF-8 and other unicode encodings.
Comment 59 Justin Mason 2005-12-15 07:56:40 UTC
the pattern doesn't need to be marked as UTF-8, since it's in itself composed of
pure US-ASCII character set; this really is a perl bug.
Comment 60 Sidney Markowitz 2005-12-15 08:27:07 UTC
Created attachment 3304 [details]
Here is a short perl script demonstrating the problem without HTML::Parser

If this is really a perl bug, then is this script, which reproduces the problem
without requiring HTML::Parser, supposed to work? I'm a bit confused about how
perl deals with setting the utf-8 flag or not in these strings.

If this script does demonstrate the bug, it might be a better example for you
to submit than the one that relies on HTML::Parser.
Comment 61 openmacnews 2005-12-15 09:06:54 UTC
and this script, w/o HTML::Parser, shows the error, here:

http://paste.lisp.org/display/14669
Comment 62 Sidney Markowitz 2005-12-15 09:43:53 UTC
It was just pointed out to me that the perl bug report that John submitted,
linked to in comment #53, contains a three line demonstration of the perl bug
that doesn't involve HTML::Parser, a lot more clear than the script I attached here.
Comment 63 Dallas Engelken 2005-12-15 15:18:19 UTC
http://rt.perl.org/rt3/Ticket/Display.html?id=37950
From: Gisle Aas <gisle@ActiveState.com>

"Already fixed in blead and perl-5.8.8-tobe.http://public.activestate.com/cgi-
bin/perlbrowse?patch=25095"
Comment 64 Dallas Engelken 2005-12-15 15:52:52 UTC
i just rebuilt the srpm for FC4 with http://public.activestate.com/cgi-
bin/perlbrowse?patch=25095 

# rpm -Uvh perl-5.8.6-22.i386.rpm
Preparing...                ########################################### [100%]
   1:perl                   ########################################### [100%]

[root@asset /]# perl demo_utf8_bug.pl
Wide character in print at demo_utf8_bug.pl line 37.
[
     â
      ¢]


# cat /root/test | spamassassin 2>&1  | grep -c Malformed
0


no more Malformed UTF-8 warns... :)
Comment 65 openmacnews 2005-12-15 20:41:28 UTC
at least for me, that patch doesn't apply against a fresh CO of perl 'stable'
(v587) ... this attach

       openmacnews.patch.SA_PERL.3787.txt

does.

now cooking a new perl build to see if it behaves ...
Comment 66 openmacnews 2005-12-15 20:42:58 UTC
Created attachment 3305 [details]
simple tweak of "http://rt.perl.org/rt3/Ticket/Display.html?id=37950" to apply/try to perl v587
Comment 67 Dallas Engelken 2005-12-15 20:52:58 UTC
and since i was apply against 5.8.5 (fc3) and 5.8.6 (fc4), this is what i 
used....  if anyone feels like rebuilding their srpms

# cat perl-5.8.6-utf8.patch
--- utf8.c.orig 2005-12-15 08:06:59.000000000 -0600
+++ utf8.c      2005-12-15 08:06:32.000000000 -0600
@@ -1976,7 +1976,7 @@
               if (u1)
                    to_utf8_fold(p1, foldbuf1, &foldlen1);
               else {
-                   natbuf[0] = *p1;
+                    uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1)));
                    to_utf8_fold(natbuf, foldbuf1, &foldlen1);
               }
               q1 = foldbuf1;
@@ -1986,7 +1986,7 @@
               if (u2)
                    to_utf8_fold(p2, foldbuf2, &foldlen2);
               else {
-                   natbuf[0] = *p2;
+                    uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2)));
                    to_utf8_fold(natbuf, foldbuf2, &foldlen2);
               }
               q2 = foldbuf2;
Comment 68 openmacnews 2005-12-16 03:12:02 UTC
looks good!  with a fresh-from-src perl v587, patched with attachment id=3305
from comment #66 above, none of the test cases trigger the "Malformed ..."
errors anymore.

loading up all previously-suspect SARE rules on my production system, and now to
watch for awhile ...

richard
Comment 69 John Gardiner Myers 2005-12-16 20:59:39 UTC
The bug does not appear to trigger when the Latin-1 character is surrounded by
the [ ] regex operator.  It is triggered when the Latin-1 character is followed
by the | regex operator and case insensitivity is in effect.

This will affect KOREAN_UCE_SUBJECT when charset normalization of headers is
implemented.
Comment 70 Justin Mason 2005-12-16 21:26:02 UTC
'The bug does not appear to trigger when the Latin-1 character is surrounded by
the [ ] regex operator.  It is triggered when the Latin-1 character is followed
by the | regex operator and case insensitivity is in effect.'

turning off case-i seems to be an easier fix, btw.
Comment 71 John Gardiner Myers 2005-12-16 21:47:38 UTC
The point of my previous comment is to help identify which rules need workarounds.  
Comment 72 openmacnews 2005-12-22 20:50:37 UTC
personally, after the patch to perl, i haven't a trace of this prob anymore.

others?

this BUG still have legs, or status change in order?
Comment 73 John Gardiner Myers 2005-12-22 21:05:53 UTC
We can hardly require Perl 5.8.8 as a minimum version.  I'm not sure what else
we can do besides maintain the disable-case-insensitivity workarounds in the rules.
Comment 74 Justin Mason 2005-12-22 21:16:33 UTC
I don't think it warrants making perl 5.8.8 a requirement for SA.
if someone runs into this warning, they can:

1. upgrade perl (last resort imo)
2. fix the rules to work around it (more likely)

I'm marking this WORKSFORME, since it's not an SA bug per se, and those
fixes/workarounds are applicable.

if we run into rules that produce this warning in the distributed rulesets, we
can fix those as they arise, now that we know how to.
Comment 75 openmacnews 2006-02-10 20:23:35 UTC
can anyone yet verify that this *has* been resolved in perl-5.8.8-release?

per notes below it was "already fixed" in "5.8.8 to be", but i have not,
personally, checked as the 587 patch *is* working ...
Comment 76 Justin Mason 2006-02-12 01:25:08 UTC
*** Bug 4665 has been marked as a duplicate of this bug. ***
Comment 77 openmacnews 2006-02-12 01:29:14 UTC
for those that give a hoot, i'll answer my own question ...

at least on OSX 10.4.4, looks like with perl588-release, all's well with SA head
(r376926), no patch required ...
Comment 78 Justin Mason 2007-04-27 02:47:28 UTC
*** Bug 5440 has been marked as a duplicate of this bug. ***
Comment 79 Justin Mason 2007-04-27 02:48:12 UTC
*** Bug 5437 has been marked as a duplicate of this bug. ***
Comment 80 Justin Mason 2007-04-27 02:59:29 UTC
ok, it's been a long time since this bug was opened, so I'll give a quick
summary for the new arrivals coming from bugs 5440 and 5437.

There's a perl bug in dealing with matching ISO-8859-1 patterns against UTF-8
strings: http://rt.perl.org/rt3/Public/Bug/Display.html?id=37950 .  Here are
the options:

- Apparently this bug is fixed in perl 5.8.8, so you could upgrade to that.

- Alternatively you could rebuild your current perl from source using the patch
  here, or at that rt.perl.org bug.

- Alternatively, you could fix the rules to avoid the bug: see comments 54, 69
  and 70.  (The easiest way is to remove the /i at the end of the rule and fix
  them to use /[iI]/ instead of just /i/ inside the patterns.)

SARE guys -- any chance the rules from comment 10 could be fixed in the
distributed SARE rulesets to include the workaround?  This is going to be a
major FAQ once 3.2.0 is released, since perl 5.8.8 is still not that common.

I'll attach a demo of one SARE rule fixed:

spamassassin -Lt -p rule.cf < badmsg  2>&1 | grep 'Malformed UTF-8 character' |
wc -l
872
spamassassin -Lt -p rule_fixed.cf < badmsg  2>&1 | grep 'Malformed UTF-8
character' | wc -l
0
Comment 81 Justin Mason 2007-04-27 03:00:03 UTC
Created attachment 3926 [details]
badmsg (sample message to trigger the bug)
Comment 82 Justin Mason 2007-04-27 03:00:47 UTC
Created attachment 3927 [details]
__SARE_OBFU_VISIT1 as it's currently distributed
Comment 83 Justin Mason 2007-04-27 03:01:18 UTC
Created attachment 3928 [details]
a fixed __SARE_OBFU_VISIT1
Comment 84 Larry Rosenbaum 2007-04-27 10:08:01 UTC
Upgrading Perl to v5.8.8 worked for me
(SpamAssassin v3.2.0-rc3 on Solaris 9)
I was trying to avoid that because Sunfreeware doesn't have a prebuilt 5.8.8 
so I had to build it from source.
Comment 85 Loren Wilton 2007-04-28 05:51:03 UTC
I'll see what I can do about fixing the obfu rules and getting Doc to update 
the file on the site.  :-)

Justin, do you know if the (?i:words) syntax (or whatever it is exactly) is 
also broken so that I have to use character classes in all cases?
Comment 86 Justin Mason 2007-05-01 06:41:42 UTC
(In reply to comment #85)
> Justin, do you know if the (?i:words) syntax (or whatever it is exactly) is 
> also broken so that I have to use character classes in all cases?

Yep, any case where a part of the pattern is case-sensitive, and matching
ISO-8859-1 characters, is broken.  Case-insensitive bits are fine though.
Comment 87 Justin Mason 2007-05-10 05:07:45 UTC
*** Bug 5459 has been marked as a duplicate of this bug. ***
Comment 88 Doc Schneider 2007-05-21 06:44:47 UTC
I have just committed fixed rule sets for SARE that cause this bug. Please let
me know if any other UTF-8 issues are found that deal with SARE rules.

Comment 89 Thomas Eisenbarth 2007-07-10 09:36:37 UTC
(In reply to comment #74)

A hint for fellow Windows users trying to make SA work under Windows:

Don't upgrade to perl 5.8.8 if you ever intend to use SA under Windows NT 
(broken command line parse). Later Windows versions seem to be OK, though.