Bug 6225

Summary: Invalid numerical HTML entity crashes perl
Product: Spamassassin Reporter: Mark Martinec <Mark.Martinec>
Component: LibrariesAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: jm
Priority: P5    
Version: SVN Trunk (Latest Devel Version)   
Target Milestone: 3.3.0   
Hardware: PC   
OS: FreeBSD   
Whiteboard:

Description Mark Martinec 2009-10-22 12:05:30 UTC
This was a heavy battle, but I finally managed to track down the reason
for perl crashes which started occurring about a week ago. The trigger was
a new type of a spam message, with obfuscated text in its HTML section
(using foreign (like cyrillic) characters with glyphs resembling ordinary
ascii characters).

The spamassasin (command line, or spamd, or amavisd) is crashing in
Mail::SpamAssassin::PerMsgStatus::_get_parsed_uri_list (called from
Plugin/URIDNSBL), trying to match a wicked decoded HTML line to a
heavyweight regexp $tbirdurire :

      while (/$tbirdurire/igo) {

Apart from a perl bug (which should not be crashing), the incident also
revealed a bug in HTML::Parser (3.62), which produced an illegal character
with a huge UTF-8 code by incorrectly parsing one numerical entity in HTML.

I reported the HTML::Parser bug to Gisle Aas, and I plan to report the
perl bug to the perl bug tracker.

Apart from waiting for the new versions of perl and new HTML::Parser,
I wonder what can be done on SpamAssassin to work around the trouble.
Reinventing HTML decoding by ourselves is not appealing, and checking
decoded utf-8 string validity is likely to be prohibitively slow.
Comment 1 Mark Martinec 2009-10-22 12:51:47 UTC
Reported to perlbug, assigned a tracker id: [perl #69973]
Comment 2 Mark Martinec 2009-10-22 15:49:05 UTC
> I reported the HTML::Parser bug to Gisle Aas

Awesome, Gisle Aas already provided the fix and uploaded an updated
version HTML-Parser-3.6 :

http://github.com/gisle/html-parser/
  commit/b9aae1e43eb2c8e989510187cff0ba3e996f9a4c

http://search.cpan.org/CPAN/authors/id/G/GA/GAAS/HTML-Parser-3.63.tar.gz
Comment 3 Warren Togami 2009-10-23 12:57:41 UTC
http://www.openwall.com/lists/oss-security/2009/10/23/8
CVE-2009-3626 perl-5.10.1
Effects only perl-5.10.1, but apparently not perl-5.8.0, perl-5.8.5, perl-5.8.8, perl-5.10.0.

http://www.openwall.com/lists/oss-security/2009/10/23/9
CVE-2009-3627 HTML-Parser-3.63
All versions prior to HTML::Parser 3.63 are effected.
Comment 4 Mark Martinec 2009-10-23 15:59:52 UTC
> Effects only perl-5.10.1, but apparently not perl-5.8.0, perl-5.8.5,
> perl-5.8.8, perl-5.10.0.

Thanks for investigating with other versions.
 
> http://www.openwall.com/lists/oss-security/2009/10/23/9
>   CVE-2009-3627 HTML-Parser-3.63
> All versions prior to HTML::Parser 3.63 are effected.
>
> Mark Martinec reported a denial of service flaw ((infinite loop),
> present in HTML-Parser in versions prior to 3.63,  while parsing
> HTML entity with invalid UTF-8 character.

Just to make it clear:
not to be confused, there are two independent problems here.

1. The crashing flaw is in a perl 5.10.1 regex evaluation, which is being
investigated. A problem in HTML::Parser facilitates triggering that perl bug, but that perl crash could occur even with fixed HTML::Parser (just needs
more malicious mail text), or even without that module;

2. Jan iankko Lieskovsky of the Red Hat Security Response Team discovered
that the HTML::Parser bug could itself cause an infinite loop, regardless
of the perl regexp bug.
Comment 5 Mark Martinec 2009-10-23 16:40:35 UTC
> 2. Jan iankko Lieskovsky of the Red Hat Security Response Team discovered
> that the HTML::Parser bug could itself cause an infinite loop, regardless
> of the perl regexp bug.

P.S.: This second item is still being investigated. It may not be independent,
could just be another manifestation of the same perl regexp utf8 bug.
Comment 6 Mark Martinec 2009-10-25 16:29:52 UTC
The perl bug has been resolved now, see:

  http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973

The problem and its resolution may have an impact on our
choice of combining utf8 with non-utf8 strings vs. regexp.
In particular, see the last paragraph:


> Resolved by:
>
> commit 0abd0d78a73da1c4d13b1c700526b7e5d03b32d4
> Author: Yves Orton <demerphq at gmail.com>
> Date: Sun Oct 25 20:37:08 2009 +0100
>
> disable non-unicode case insensitive trie matching
>
> Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely
> masked one symptom of the deeper problems.
>
> Also fixes RT #69973, which was a segfault which was exposed by
> 8902bb05, see the ticket for further details.
>
> http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973
>
> At the core of this is the problem that in unicode matching a bunch
> of code points have case folding rules beyond just A-Z/a-z. Since
> the case folding rules are decided at runtime by the string, we can't
> use the same TRIE tables for both unicode/non-unicode matching.
>
> Until this is reconciled or some other solution is found case
> insensitive matching only gets the TRIE optimisation when the pattern
> is unicode.
Comment 7 Mark Martinec 2009-11-09 11:46:58 UTC
  Bug 6225: untaint the string in an attempt to work around
  a perl crash - a workaround for [perl #69973] bug:
    Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation
  A regexp and a string should both be utf8, or none of them;
  untainting string also seems to avoid the crash.
Sending lib/Mail/SpamAssassin/PerMsgStatus.pm
Committed revision 834191.
Comment 8 Mark Martinec 2009-11-11 10:14:42 UTC
Preventively local()-ize $1 in a couple of places in PerMsgStatus
in view of in view of [perl #67962]
Sending lib/Mail/SpamAssassin/PerMsgStatus.pm
Committed revision 834996.
Comment 9 Mark Martinec 2009-11-11 10:28:08 UTC
The workaround in Comment 7 seems to avoid the immediate threat, so
I'm closing this bug. The HTML::Parser issue is a bug alright, but is
a minor concern to SpamAssassin, apart from facilitating triggering
of the perl 5.10.1 bug (which we now avoid).

The concern of a disabled TRIE optimisation in perl when the pattern
is not unicode has potential impact on our rule matching, but this
is something hard to solve in the regexp-heavy SpamAssassin. If this
is to be discussed, another PR should be opened.
Comment 10 Mark Martinec 2009-11-18 12:51:52 UTC
See also Bug 6238.
Comment 11 Mark Martinec 2009-11-23 07:40:20 UTC
See also bug 6240, the perl crash there happens in the same spot
in _get_parsed_uri_list, but possibly for a different reason.
Comment 12 Tom Schulz 2009-11-23 11:51:39 UTC
Perhaps it would be better if Spamassassin would refuse to run with the bad
version of Perl. You probably can not find all the places where you could trigger
that bug.