Bug 6389 - FPs on DOS_HIGHBIT_HDRS_BODY
Summary: FPs on DOS_HIGHBIT_HDRS_BODY
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.1
Hardware: All All
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-26 07:33 UTC by John Wilcock
Modified: 2010-04-12 18:50 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Sample FP message text/plain None John Wilcock [NoCLA]
a very simple test email written in Chinese that triggers DOS_HIGHBIT_HDRS_BODY message/rfc822 None lee_yiu_chung@yahoo.com [NoCLA]
Another FP text/plain None John Wilcock [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description John Wilcock 2010-03-26 07:33:21 UTC
Created attachment 4721 [details]
Sample FP message 

I've seen a few FPs on this rule from genuine ham sent by one of my colleagues using Thunderbird 3.0.4 - not all her mail, but specifically replies to certain messages with UTF-8 encoding. 

I'm seeing very few occurrences, but given the high default score of this rule (3.8) I've set the severity to normal rather than minor.
Comment 1 lee_yiu_chung 2010-04-01 15:15:13 UTC
Created attachment 4730 [details]
a very simple test email written in Chinese that triggers DOS_HIGHBIT_HDRS_BODY
Comment 2 lee_yiu_chung 2010-04-01 15:21:32 UTC
I would say, this rule is unfriendly to non-English email. Attached an email written in Chinese, which From header contains "李耀宗", which is my Chinese name), with both Subject and content are "這是測試" (meaning "This is a test").

I have to consider to disable this rule completely in my mail server. Totally unfriendly to Chinese. There are similar non-English unfriendly rules, as in bug 5859 (which is reported by me too, and still not fixed for 2 years.)
Comment 3 lee_yiu_chung 2010-04-01 15:59:26 UTC
(In reply to comment #2)
> I would say, this rule is unfriendly to non-English email. Attached an email
> written in Chinese, which From header contains "李耀宗", which is my Chinese
> name), with both Subject and content are "這是測試" (meaning "This is a test").
> 
> I have to consider to disable this rule completely in my mail server. Totally
> unfriendly to Chinese. There are similar non-English unfriendly rules, as in
> bug 5859 (which is reported by me too, and still not fixed for 2 years.)

Further comment: this rule would be triggered on pratically all emails written in Chinese/Japanese/Korean (or some multibyte charset) (which would probably contains mentioned characters in From/Subject headers. I would urge that this rule should be removed.
Comment 4 Adam Katz 2010-04-06 23:35:26 UTC
Regarding comment 0 and its sample FP attachment 4721 [details], it looks like that should have been ALL_TRUSTED (see the documentation for internal_networks).  While this doesn't solve the bug, it would help alleviate the spammy-messages-from-colleagues problem.

Hm.  This header from attachment 4730 [details] is quite interesting:

X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id o31FCcI16161

I believe this is reporting that ctimail's mail system converted the quoted-printable headers to 8bit, which triggered the rule.  Plugging that header into google shows 19k hits, which is small but not intangible.  Even my own sendmail server has added it in the past.  Comparative data: X-Spam-Status (236k), X--MailScanner (10k), X-Spam-Flag (27k), X-Greylist (17k), X-X-Sender (9k), X-Sieve (7k), X-Received (16k) ... (searches performed in quotes with a second query being "Message-ID" to ensure we're looking at email headers).

I've placed a possible fix into our QA system (20_bug_6389.cf in my sandbox) to sanity-check it, containing the following code (the first rule is just a popularity test for that header):

header __HAS_XMIME_AUTOCONV     exists:X-MIME-Autoconverted
header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to 8bit/
meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64 && __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

Sadly, this doesn't help the first sample. Appending "&& !__RCVD_VIA_APNIC_LE" would also fail to solve it since it is from France and not Asia. According to yesterday's numbers, that extra requirement would also reduce the spam hit by 43% and the ham by under 20%, reducing 1.1268% spam to 0.7423% and the ham to somewhere between 0.0261% and the current 0.0326%.

I'm disheartened by the French FP as it was composed with the latest version of Thunderbird (3.0.4, WinXP, French build), but at least configuring internal_networks would solve it for that particular user's internal company mail.  For a full fix, I can think of nothing but removing this rule.  The question becomes:  how many FPs does this rule really create, i.e. is this an isolated incident?
Comment 5 lee_yiu_chung 2010-04-07 06:20:44 UTC
(In reply to comment #4)
> Regarding comment 0 and its sample FP attachment 4721 [details], it looks like that
> should have been ALL_TRUSTED (see the documentation for internal_networks). 
> While this doesn't solve the bug, it would help alleviate the
> spammy-messages-from-colleagues problem.
> 
> Hm.  This header from attachment 4730 [details] is quite interesting:
> 
> X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id
> o31FCcI16161
> 
> I believe this is reporting that ctimail's mail system converted the
> quoted-printable headers to 8bit, which triggered the rule.  Plugging that
> header into google shows 19k hits, which is small but not intangible.  Even my
> own sendmail server has added it in the past.  Comparative data: X-Spam-Status
> (236k), X--MailScanner (10k), X-Spam-Flag (27k), X-Greylist (17k), X-X-Sender
> (9k), X-Sieve (7k), X-Received (16k) ... (searches performed in quotes with a
> second query being "Message-ID" to ensure we're looking at email headers).
> 
> I've placed a possible fix into our QA system (20_bug_6389.cf in my sandbox) to
> sanity-check it, containing the following code (the first rule is just a
> popularity test for that header):
> 
> header __HAS_XMIME_AUTOCONV     exists:X-MIME-Autoconverted
> header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to
> 8bit/
> meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64
> && __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS &&
> !__MIME_QP_TO_8BIT
> 
> Sadly, this doesn't help the first sample. Appending "&& !__RCVD_VIA_APNIC_LE"
> would also fail to solve it since it is from France and not Asia. According to
> yesterday's numbers, that extra requirement would also reduce the spam hit by
> 43% and the ham by under 20%, reducing 1.1268% spam to 0.7423% and the ham to
> somewhere between 0.0261% and the current 0.0326%.
> 
> I'm disheartened by the French FP as it was composed with the latest version of
> Thunderbird (3.0.4, WinXP, French build), but at least configuring
> internal_networks would solve it for that particular user's internal company
> mail.  For a full fix, I can think of nothing but removing this rule.  The
> question becomes:  how many FPs does this rule really create, i.e. is this an
> isolated incident?


According to my email sample (attachment 4730 [details]), the email is scanned by SpamAssassin before QP-to-8bit conversion (note the mail id o31FCcI16161)

Received: from smtp1o.ctimail.com (smtp1 [203.186.94.57])
	by popo.ctimail.com (8.11.1/8.11.1) with ESMTP id o31FCcI16161
	for <leeyc0@popo.ctimail.com>; Thu, 1 Apr 2010 23:12:38 +0800 (CST)
Received: from iguard1-206.hkbn.net (iguard1-206.hkbn.net [203.186.220.206])
	by smtp1o.ctimail.com (8.12.11/8.12.11) with ESMTP id o31FCalG014728
	for <leeyc0@hkbn.net>; Thu, 1 Apr 2010 23:12:38 +0800 (HKT)
Received: from violet.alumni.cuhk.net ([202.45.188.23])
  by iguard1.hkbn.net with ESMTP; 01 Apr 2010 23:12:37 +0800
Received: from asavgw1.alumni.cuhk.net (asavgw1.alumni.cuhk.net [202.45.188.44])
	by violet.alumni.cuhk.net (8.14.3/8.14.3) with ESMTP id o31FCUvr000701
	for <leeyc0@alumni.cuhk.net>; Thu, 1 Apr 2010 23:12:31 +0800
Received: from ieaa.ie.cuhk.edu.hk ([137.189.97.6])
  by asavgw1.alumni.cuhk.net with ESMTP; 01 Apr 2010 23:12:36 +0800
Received: from smtp.ctimail.com ([203.186.94.58] helo=smtpo.ctimail.com)
	by ieaa.ie.cuhk.edu.hk with esmtp (Exim 4.63)
	(envelope-from <leeyc0@alumni.cuhk.net>)
	id 1NxM4R-0006GD-8l
	for leeyc0@ieaa.org; Thu, 01 Apr 2010 23:12:36 +0800
Received: from [127.0.0.1] (119247234247.ctinets.com [119.247.234.247])
	by smtpo.ctimail.com (8.12.11/8.12.11) with ESMTP id o31FCROw020860
	for <leeyc0@ieaa.org>; Thu, 1 Apr 2010 23:12:27 +0800 (HKT)
X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id o31FCcI16161

I would say, the real bug should be in 20_html_tests.cf, which says

body __HIGHBITS                     /(?:[\x80-\xff].?){4}/

I think it should be 
rawbody __HIGHBITS                     /(?:[\x80-\xff].?){4}/
Comment 6 lee_yiu_chung 2010-04-07 06:25:04 UTC
> I would say, the real bug should be in 20_html_tests.cf, which says
> 
> body __HIGHBITS                     /(?:[\x80-\xff].?){4}/
> 
> I think it should be 
> rawbody __HIGHBITS                     /(?:[\x80-\xff].?){4}/

Sorry, rawbody doesn't fix either. According to

http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Conf.html

"The 'raw body' of a message is the raw data inside all textual parts. The text will be decoded from base64 or quoted-printable encoding..."

That is, almost all non-English would be triggered by this rule...
Comment 7 John Wilcock 2010-04-07 07:39:55 UTC
Created attachment 4735 [details]
Another FP

My first FP sample was indeed "saved" by ALL_TRUSTED (and BAYES_00). 

Here's another one, an opt-in newsletter that was only saved by RCVD_IN_RP_CERTIFIED and RCVD_IN_RP_SAFE (it also had a valid DKIM signature, which I've no doubt invalidated by obfuscating the recipient's address). 

Any messages from people with highbit characters in their names, and with highbit characters in the subject and body will potentially hit the rule - and that is inevitably a fairly common scenario for non-English mail. 

What appears to be saving this rule from more FPs is the check for base64 encoding of the headers. Thunderbird, for instance, appears to use quoted-printable encoding for ISO-8859-1, its default charset, and only switches to base64 for UTF-8 and other multibyte charsets. 

Does the rule actually hit much spam that wouldn't be caught otherwise? On my two low-volume servers I have only one spam hit that would have scored under 10 points without this rule, and none that would have been FNs.
Comment 8 John Wilcock 2010-04-07 08:18:50 UTC
(In reply to comment #7)
> What appears to be saving this rule from more FPs is the check for base64
> encoding of the headers. Thunderbird, for instance, appears to use
> quoted-printable encoding for ISO-8859-1, its default charset, and only
> switches to base64 for UTF-8 and other multibyte charsets. 

Thinking a bit more about this, could this be the basis for reducing the FP rate of the rule? 

Something like the following:

header __FROM_1BYTE_B64 From:raw =~ /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i 
header __SUBJ_1BYTE_B64 Subject:raw =~ /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i 

meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 && __FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT
Comment 9 lee_yiu_chung 2010-04-07 11:20:22 UTC
(In reply to comment #8)
> Thinking a bit more about this, could this be the basis for reducing the FP
> rate of the rule? 
> 
> Something like the following:
> 
> header __FROM_1BYTE_B64 From:raw =~
> /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i 
> header __SUBJ_1BYTE_B64 Subject:raw =~
> /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i 
> 
> meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
> __FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

In my opinion, __HIGHBITS is fundamentally flawed that should never exist.
No matter using "body" rule or "rawbody" rule the message body is always decoded from base64 or quoted-printable. And __HIGHBITS matches streams of high bits octets (to be exact, [one high bit octet + any character] repeated 4 times), which is always the case for East Asian languagues.

Is there any way to check message body but doesn't perform base64 or quoted-printable encoding?
Comment 10 Justin Mason 2010-04-07 11:53:33 UTC
(In reply to comment #9)
> In my opinion, __HIGHBITS is fundamentally flawed that should never exist.
> No matter using "body" rule or "rawbody" rule the message body is always
> decoded from base64 or quoted-printable. And __HIGHBITS matches streams of high
> bits octets (to be exact, [one high bit octet + any character] repeated 4
> times), which is always the case for East Asian languagues.

hi -- as far as I know, the intention of __HIGHBITS is to *detect* such charsets as the East Asian ones, so that it can be used in meta rules to avoid false positives.
Comment 11 lee_yiu_chung 2010-04-07 12:37:36 UTC
> hi -- as far as I know, the intention of __HIGHBITS is to *detect* such
> charsets as the East Asian ones, so that it can be used in meta rules to avoid
> false positives.

Understand. Then it is just DOS_HIGHBIT_HDRS_BODY flawed, not __HIGHBITS.
Comment 12 Daryl C. W. O'Shea 2010-04-09 03:43:59 UTC
I agree, the rule looks bad.  I've commented it out.  It should disappear from updates this weekend.

[dos@cyan dos]$ svn ci -m "bug 6389: comment out DOS_HIGHBIT_HDRS_BODY due to FPs"
Authentication realm: <https://svn.apache.org:443> ASF Committers
Password for 'dos':
Sending        dos/70_other.cf
Transmitting file data .
Committed revision 932235.
Comment 13 Daryl C. W. O'Shea 2010-04-09 03:44:29 UTC
Closing as fixed.
Comment 14 lee_yiu_chung 2010-04-11 11:03:50 UTC
How about DOS_HIGHBIT_HDRS_BODY_BUG6389? I found this rule is distributed through updates. As I mentioned before it doesn't seem to a fix to this problem, which should be removed too.
Comment 15 Adam Katz 2010-04-12 18:50:51 UTC
Just a follow-up because I had some investigations running when this was closed...

Rules
------------------
# From rulesrc/sandbox/khopesh/20_bug_6389.cf on trunk at r932438
# http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/khopesh/20_bug_6389.cf?revision=932438&view=markup

# just a raw numbers check:
header __HAS_XMIME_AUTOCONV exists:X-MIME-Autoconverted
tflags __HAS_XMIME_AUTOCONV nice

# possible fix to bug 6389
header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to 8bit/
tflags __MIME_QP_TO_8BIT nice

# John Wilcock's proposed subtitutions for __..._ENCODED_B64 (comment 8)
header __FROM_1BYTE_B64 From:raw =~ /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
header __SUBJ_1BYTE_B64 Subject:raw =~ /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i

meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64 && __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

# Daryl O'Shea (DOS) + Adam Katz (KHOP) + John Wilcock version
meta FROM_SUBJ_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 && __FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

# assuming recipients won't also be highbit'd ("highbitten?")
header __TO_1BYTE_B64 To:raw =~ /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
meta FROM_SUBJ_NOTO_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 && __FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT && !__TO_1BYTE_B64


Results from 2010-04-11 (non-net run)
------------------

http://ruleqa.spamassassin.org/20100411-r932853-n/%2FDOS_HIGHB|MIME_QP_TO_|HAS_XMIME_|_1BYTE_B64|_ENCODED_B64|FROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1775   0.0359   0.970    0.82    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.0718   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0714   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.5069   0.2155   0.702    0.62   (n/a)  __SUBJ_1BYTE_B64
 0.0928   0.1333   0.410    0.53   (n/a)  __FROM_1BYTE_B64
 2.3337   2.3339   0.500    0.51   (n/a)  __SUBJECT_ENCODED_B64
 1.3552   1.7032   0.443    0.50   (n/a)  __FROM_ENCODED_B64
 0.0004   0.1519   0.003    0.31   (n/a)  __TO_1BYTE_B64
 6.2081   1.0613   0.854    0.24   (n/a)  __HAS_XMIME_AUTOCONV
 6.1458   0.9837   0.862    0.24   (n/a)  __MIME_QP_TO_8BIT

That rules out the suggestions from comment 8.  Because Daryl removed the original rule, it's not listed here, but my modification did little to nothing.

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham:  0  79.31%   69 *******************************
  scoremap  ham:  1   3.45%    3 *
  scoremap  ham:  2  16.09%   14 ******
  scoremap  ham:  3   1.15%    1
  scoremap spam:  0   2.85%  413 *
  scoremap spam:  1   0.15%   22
  scoremap spam:  2  18.89% 2734 *******
  scoremap spam:  3   3.70%  536 *
  scoremap spam:  4   4.40%  637 *
  scoremap spam:  5  12.40% 1794 ****
  scoremap spam:  6   5.51%  797 **
  scoremap spam:  7   7.81% 1130 ***
  scoremap spam:  8  10.22% 1479 ****
  scoremap spam:  9   5.66%  819 **
  scoremap spam: 10   7.17% 1037 **
  scoremap spam: 11   5.80%  839 **
  scoremap spam: 12   4.35%  629 *
  scoremap spam: 13   2.74%  396 *
  scoremap spam: 14   2.64%  382 *
  scoremap spam: 15   1.53%  221
  scoremap spam: 16   1.29%  187
  scoremap spam: 17   0.98%  142
  scoremap spam: 18   0.53%   76
  scoremap spam: 19   0.53%   76
  scoremap spam: 20   0.27%   39
  scoremap spam: 21   0.20%   29
  scoremap spam: 22   0.12%   17
  scoremap spam: 23   0.08%   12
  scoremap spam: 24   0.10%   15
  scoremap spam: 25   0.01%    2
  scoremap spam: 26   0.01%    2
  scoremap spam: 28   0.02%    3
  scoremap spam: 29   0.01%    2
  scoremap spam: 30   0.01%    1
  scoremap spam: 32   0.02%    3
  scoremap spam: 33   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 72%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 55%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%

Note that despite this being a non-net run, the overlap still has RDNS_NONE as the only matching (published) non-net rule that overlapped over 50%.  In a scan completely lacking network tests, the score-map would be even lower and the rule would appear more valuable.


Results from 2010-04-10 (net run)
------------------

http://ruleqa.spamassassin.org/20100410-r932679-n/%2FDOS_HIGHB%7CMIME_QP_TO_%7CHAS_XMIME_%7C_1BYTE_B64%7C_ENCODED_B64%7CFROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1755   0.0116   0.990    0.86    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.5164   0.0390   0.930    0.76   (n/a)  __SUBJ_1BYTE_B64
 0.0685        0   1.000    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0682        0   1.000    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.0854   0.0435   0.663    0.61   (n/a)  __FROM_1BYTE_B64
 2.3165   2.0477   0.531    0.52   (n/a)  __SUBJECT_ENCODED_B64
 1.3498   1.6534   0.449    0.51   (n/a)  __FROM_ENCODED_B64
 0.0004   0.0099   0.039    0.47   (n/a)  __TO_1BYTE_B64
 6.2616   1.1081   0.850    0.23   (n/a)  __HAS_XMIME_AUTOCONV
 6.1999   1.0350   0.857    0.23   (n/a)  __MIME_QP_TO_8BIT

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham: -2  65.38%   17 **************************
  scoremap  ham:  0  26.92%    7 **********
  scoremap  ham:  1   3.85%    1 *
  scoremap  ham:  4   3.85%    1 *
  scoremap spam:  0   0.05%    7
  scoremap spam:  1   0.20%   29
  scoremap spam:  2   0.78%  113
  scoremap spam:  3   0.55%   80
  scoremap spam:  4   1.11%  161
  scoremap spam:  5   1.56%  226
  scoremap spam:  6   2.57%  373 *
  scoremap spam:  7   3.76%  546 *
  scoremap spam:  8   4.86%  705 *
  scoremap spam:  9   6.58%  955 **
  scoremap spam: 10   7.68% 1114 ***
  scoremap spam: 11   8.85% 1284 ***
  scoremap spam: 12   8.48% 1230 ***
  scoremap spam: 13   8.19% 1188 ***
  scoremap spam: 14   8.07% 1171 ***
  scoremap spam: 15   6.81%  989 **
  scoremap spam: 16   6.02%  873 **
  scoremap spam: 17   5.29%  767 **
  scoremap spam: 18   4.36%  632 *
  scoremap spam: 19   3.41%  495 *
  scoremap spam: 20   2.56%  371 *
  scoremap spam: 21   2.06%  299
  scoremap spam: 22   1.45%  211
  scoremap spam: 23   1.13%  164
  scoremap spam: 24   0.87%  126
  scoremap spam: 25   0.74%  108
  scoremap spam: 26   0.59%   85
  scoremap spam: 27   0.28%   40
  scoremap spam: 28   0.19%   27
  scoremap spam: 29   0.10%   14
  scoremap spam: 30   0.24%   35
  scoremap spam: 31   0.10%   14
  scoremap spam: 32   0.11%   16
  scoremap spam: 33   0.10%   14
  scoremap spam: 34   0.05%    7
  scoremap spam: 35   0.08%   11
  scoremap spam: 36   0.08%   12
  scoremap spam: 37   0.03%    4
  scoremap spam: 38   0.03%    4
  scoremap spam: 39   0.01%    2
  scoremap spam: 40   0.03%    4
  scoremap spam: 41   0.01%    2
  scoremap spam: 42   0.01%    2
  scoremap spam: 43   0.01%    1
  scoremap spam: 47   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 95%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BRBL_LASTEXT      1%
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 73%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 61%  T_DOS_HIGHBIT_HDRS...6389  T_RCVD_IN_ANBREP_BL       1%
 56%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 54%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%
 50%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_E4_51_100 3%


Conclusion
------------------

This rule is not worthwhile in network-enabled checks.  Without network tests, this rule may be extremely valuable.  Assuming we're interested in developing offline-only tests, this is worth revisiting once we have more corpora from areas that use non-Latin character sets (specifically China), especially if we can pin it to not fire on network tests.


I have removed the tests from SVN (satisfying comment #14).  They will disappear from the ruleqa system in the next day or two.

$ svn delete --force 20_bug_6389.cf
D         20_bug_6389.cf
$ svn commit -m "Bug closed.  I posted my observations, including this file's contents and stats for ent and non-net runs, on bug 6389, comment 14" 20_bug_6389.cf
Deleting       20_bug_6389.cf

Committed revision 933340.
$