Bug 4255 - Suggestion for new rule: Anti-phishing rule.
Summary: Suggestion for new rule: Anti-phishing rule.
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.0.2
Hardware: All Linux
: P5 enhancement
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4372 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-04-10 17:22 UTC by Dianne Skoll
Modified: 2013-09-02 09:17 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
plugin implementation text/plain None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dianne Skoll 2005-04-10 17:22:21 UTC
Hi,

I've had good luck catching some phishing spams with this rule:

full HTTP_CLAIMS_HTTPS  /<a[^>]{0,90}http:[^>]{0,90}>[^<]{0,90}https:/is

This catches phish that do something like this:

Click here: <a href="http:/1.2.3.4/cgi.bin/scam">https://www.paypal.com</a>
Comment 1 John Gardiner Myers 2005-04-10 17:32:10 UTC
Shouldn't that be rawbody, not full?
Comment 2 Dianne Skoll 2005-04-10 17:36:44 UTC
No, it should be full, not rawbody.  Go figure out why, write back if you can't. :-)

Regards,

David.
Comment 3 Dianne Skoll 2005-04-10 19:52:00 UTC
Actually, you may need both "rawbody" and "full".  The problem is that rawbody
matches only a line at a time, so it misses things like:

<a
href="http://foo.com">
https://www.ebay.com
</a>

but "full" will miss base-64 encoded bodies.  I don't see a proper solution to
this until/unless SpamAssassin includes a real HTML parser.
Comment 4 Loren Wilton 2005-04-11 00:04:20 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

> Shouldn't that be rawbody, not full?

It really needs to be both.  :-(
Rawbody parses text one line at a time, so if the text wraps a line the rule
will fail.

Comment 5 Loren Wilton 2005-04-11 00:06:41 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

> I don't see a proper solution to
> this until/unless SpamAssassin includes a real HTML parser.

I submitted a bug on this back in 2.63 days that I suspect is still open.
The solution is to allow rawbody to process a body and not a line, just like
body and full (more or less) can.

Comment 6 Theo Van Dinter 2005-04-11 08:41:54 UTC
Subject: Re:   New: Suggestion for new rule: Anti-phishing rule.

On Sun, Apr 10, 2005 at 05:22:21PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I've had good luck catching some phishing spams with this rule:
> full HTTP_CLAIMS_HTTPS  /<a[^>]{0,90}http:[^>]{0,90}>[^<]{0,90}https:/is
> This catches phish that do something like this:
> Click here: <a href="http:/1.2.3.4/cgi.bin/scam">https://www.paypal.com</a>

Seems like HTTPS_IP_MISMATCH ...  Already in 3.1. :)

Comment 7 Dianne Skoll 2005-04-11 09:01:46 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

bugzilla-daemon@bugzilla.spamassassin.org wrote:

> Seems like HTTPS_IP_MISMATCH ...  Already in 3.1. :)

Looks similar, but unless I misunderstand that rule, it only looks at
IP addresses, not domain names.

Regards,

David.

Comment 8 Theo Van Dinter 2005-04-11 09:16:16 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

On Mon, Apr 11, 2005 at 09:01:46AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Looks similar, but unless I misunderstand that rule, it only looks at
> IP addresses, not domain names.

As I recall, if it wasn't limited to IPs (which catches the sample you
provided BTW), the results weren't very good.  I have samples of legit lists
that for whatever reason do this type of thing.

Comment 9 Theo Van Dinter 2005-04-11 09:20:46 UTC
(In reply to comment #3)
> but "full" will miss base-64 encoded bodies.  I don't see a proper solution to
> this until/unless SpamAssassin includes a real HTML parser.

What's wrong with the current HTML parser?  The problem here is trying to
implement the rule in the wrong way.  It's pretty trivial in an eval, and trying
to do it any other way, as mentioned, isn't a very good solution.

Anyway, since this rule is already implemented, closing as WFM.


BTW: if anyone's interested in the HTTPS_IP_MISMATCH results, from a recent
nightly run:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 502218   379437   122781    0.756   0.00    0.00  (all messages)
100.000  75.5523  24.4477    0.756   0.00    0.00  (all messages as %)
  0.079   0.1044   0.0000    1.000   0.54    1.00  HTTPS_IP_MISMATCH
Comment 10 Daniel Quinlan 2005-04-11 11:44:43 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

Doing a rule that looks for:

  link: http://.*
  text: https://.*

Meaning the text is implying greater security than provided.

That is what the bug is suggesting and it might be useful, maybe try
both with and without different hostnames.

Comment 11 Dianne Skoll 2005-04-11 11:48:54 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

bugzilla-daemon@bugzilla.spamassassin.org wrote:

> Doing a rule that looks for:
> 
>   link: http://.*
>   text: https://.*
> 
> Meaning the text is implying greater security than provided.

Yes, exactly.  I can't see why this would lead to false-positives
unless someone has made a (very specific) typo.  I don't even
check the host name or IP address parts.

This won't catch a lot of phish, but judging from what I've seen,
it should get about 15-25% of them.

I'm not extremely familiar with the innards of SpamAssassin's
HTML parser; I'll have to look into that.  But this would clearly
better be done as an eval rule than a regexp rule.

Regards,

David.

Comment 12 Dianne Skoll 2005-04-11 15:48:18 UTC
I dont agree with the WORKSFORME closure, because this rule does something quite
different from the one in 3.1.
Comment 13 Theo Van Dinter 2005-04-11 16:40:35 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

On Mon, Apr 11, 2005 at 03:48:18PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I dont agree with the WORKSFORME closure, because this rule does something quite
> different from the one in 3.1.

Well, it's actually not "quite different".  It's almost exactly the same
thing really.  The only difference is that the current HTTPS_IP_MISMATCH
looks specifically for a URI which goes to an IP and can be for http
or https.  The rule in question is indiscriminate but limited to http.

Ok, so I did up an eval rule and did some tests...  If I make it simply
look for URIs which start "http:" which have anchor text starting with
"https:": (based on the last 14 days)

  0.080   0.0979   0.0000    1.000   1.00    0.01  T_HTTPS_URI_MISMATCH

If I open it up to have anchor text with /\bhttps:/ instead:

  0.123   0.1502   0.0000    1.000   1.00    0.01  T_HTTPS_URI_MISMATCH


So I could have sworn that this was tested and rejected for S/O reasons,
but either I'm on crack or there was something else which made it horrible
at the time (I do have a few FPs in the corpus for legit newsletters
back from earlier this year, FWIW...)  Anyway, I apologize for premature
ticket closure.  We really ought to document the stuff we try so that
in the future we can just look it up and see if this was tried before
or not.  <sigh>

Anyway, committed my eval version, r160982.  :)

Comment 14 Theo Van Dinter 2005-04-11 16:41:04 UTC
closing again. :)
Comment 15 Justin Mason 2005-04-11 17:23:55 UTC
btw I remember one FP I had -- a Paypal Australia URL iirc.  anyway, let's see
what the nightly results say.  that looks pretty good ;)
Comment 16 Theo Van Dinter 2005-04-15 12:23:21 UTC
Apparently my memory is better than I thought:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 505697   381380   124317    0.754   0.00    0.00  (all messages)
100.000  75.4167  24.5833    0.754   0.00    0.00  (all messages as %)
  0.120   0.1453   0.0418    0.776   0.49    0.01  T_HTTPS_URI_MISMATCH
  0.003   0.0039   0.0000    1.000   0.46    1.00  HTTPS_IP_MISMATCH

So it does hit a much better amount of spam (wow HTTPS_IP_MISMATCH hits almost
nothing...?), but the S/O ratio is quite nasty.  All of my FPs are legit:
American Express, Bed Bath & Beyond, Universal Studios, Microsoft, etc.  I
haven't checked with other folks, but I'd imagine it's similar there.
Comment 17 Tom Schulz 2005-04-15 12:35:10 UTC
Is your rule checking the direction of the mismatch?  If they claim to be
using https but are really using http, that would be bad.  If they claim to
be using http but are really using https, that would be ok (stupid, but ok).
Comment 18 Theo Van Dinter 2005-04-15 12:46:32 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

On Fri, Apr 15, 2005 at 12:35:10PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Is your rule checking the direction of the mismatch?  If they claim to be
> using https but are really using http, that would be bad.  If they claim to
> be using http but are really using https, that would be ok (stupid, but ok).

Of course.  The rule looks for:

<a href="http://www.example.com/">https://www.example.com/</a>

My guess is that the marketing people want to know who's following the links
from the email, so you goto their http click-track server, which redirects you
to the secure site.

Comment 19 Dianne Skoll 2005-04-15 12:53:38 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.


> ------- Additional Comments From schulz@adi.com  2005-04-15 12:35 -------
> Is your rule checking the direction of the mismatch?  If they claim to be
> using https but are really using http, that would be bad.  If they claim to
> be using http but are really using https, that would be ok (stupid, but ok).

Yes, I'm amazed there are any FP's with this rule.  Wow.

I haven't seen any, but then again, the rule has only hit twice for me.

Regards,

David.

Comment 20 Dianne Skoll 2005-04-15 13:26:13 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ------- Additional Comments From felicity@kluge.net  2005-04-15 12:46 -------

> My guess is that the marketing people want to know who's following
> the links from the email, so you goto their http click-track server,
> which redirects you to the secure site.

Well, I'm not sure if it's the SpamAssassin team's mandate to try to
change behavior, but that is exactly the kind of boneheaded thing that
we should discourage anyone from doing because it hurts security for
everyone.

I'm probably going to add the rule to our product, but maybe not with
a score of 5 (as I use myself.) :-)

Regards,

David.

Comment 21 Loren Wilton 2005-04-16 09:18:06 UTC
Subject: Re: [SPAM]07.19  Suggestion for new rule: Anti-phishing rule.

Many of those spams (but not all, of course) also use a numeric ip.  Perhaps
the other test already does this, but if not, how about something like

rawbody  T__LW_PHISH_2
m'<a\s+[\s\w=\.]*href=\"https?://\d+[^>]+>https://[^\d]'is

Comment 22 Theo Van Dinter 2005-04-16 11:12:35 UTC
Subject: Re:  Suggestion for new rule: Anti-phishing rule.

On Sat, Apr 16, 2005 at 09:18:07AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Many of those spams (but not all, of course) also use a numeric ip.  Perhaps
> the other test already does this, but if not, how about something like

HTTPS_IP_MISMATCH already looks for the IP ones.

Comment 23 Theo Van Dinter 2006-03-07 04:59:37 UTC
*** Bug 4372 has been marked as a duplicate of this bug. ***
Comment 24 Theo Van Dinter 2006-03-16 06:03:01 UTC
Since we pretty frequently get this idea posted to the users@ list (twice yesterday!), I'll be putting up a 
plugin to check for this in a minute.  The rule is even more horrible now than it was in the past.  
Hopefully this will put this idea to bed, or allow people to figure out a way to deal with the false 
positives (by all means, if you come up with a useful way to check for this sans FPs, let us know!)  As an 
example of FPs:

<a href="http://www65.americanexpress.com/clicktrk/Tracking?mid=MESSAGEID&msrc=ENG-
ALERTS&url=https://www.americanexpress.com/estatement/?12345">https://
www.americanexpress.com/estatement/?12345</a>

<A HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID">https://www.hilton.com/
en/ww/email/tab_email_subscriptions.jhtml</A>


Anyway, here's the results...

looking for href of http: and anchor text of https with mismatching host sections:

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0    28446     5023    0.850   0.00    0.00  (all messages)
0.00000  84.9921  15.0079    0.850   0.00    0.00  (all messages as %)
  0.335   0.3586   0.1991    0.643   0.00    0.01  T_HTTPS_HTTP_MISMATCH

looking for href of https?: and anchor text of https with mismatching host sections (I'm not 100% 
certain how this hit less ham, but it's still a bad hit rate):

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0    28446     5023    0.850   0.00    0.00  (all messages)
0.00000  84.9921  15.0079    0.850   0.00    0.00  (all messages as %)
  0.329   0.3586   0.1593    0.692   0.00    0.01  T_HTTPS_HTTP_MISMATCH

looking for href of https?: and anchor text of https with mismatching domains from the host section:

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0    28446     5023    0.850   0.00    0.00  (all messages)
0.00000  84.9921  15.0079    0.850   0.00    0.00  (all messages as %)
  0.302   0.3340   0.1195    0.737   0.00    0.01  T_HTTPS_HTTP_MISMATCH

so unless there's a coding issue, this rule really doesn't work as a spam detection rule.
Comment 25 Theo Van Dinter 2006-03-16 06:08:53 UTC
Created attachment 3416 [details]
plugin implementation

something like:

loadplugin Mail::SpamAssassin::Plugin::TVD
body T_HTTPS_HTTP_MISMATCH eval:check_https_http_mismatch()
Comment 26 Theo Van Dinter 2006-03-16 18:09:29 UTC
also tested, domain mismatch between https? href and https? anchor text:

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0    29213     5299    0.846   0.00    0.00  (all messages)
0.00000  84.6459  15.3541    0.846   0.00    0.00  (all messages as %)
  0.898   0.7086   1.9438    0.267   0.00    0.01  T_HTTPS_HTTP_MISMATCH
Comment 27 Philip Prindeville 2006-03-16 19:40:37 UTC
(In reply to comment #25)
> Created an attachment (id=3416) [edit]
> plugin implementation
> 
> something like:
> 
> loadplugin Mail::SpamAssassin::Plugin::TVD
> body T_HTTPS_HTTP_MISMATCH eval:check_https_http_mismatch()

Does this work with 3.1.0?

I tried to run it, and it complained that it couldn't find the Logger module,
which seems to be unnecessary anyway... so I commented that line out.

Running it again, I saw:

debug: registering glue method for check_https_http_mismatch
(Mail::SpamAssassin::Plugin::TVD=HASH(0x1940350))
Failed to run T_HTTPS_HTTP_MISMATCH SpamAssassin test, skipping:
        (Can't locate object method "get_uri_detail_list" via package
"Mail::SpamAssassin::PerMsgStatus" at
/usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Plugin/TVD.pm line 46.
)


What am I missing?


Comment 28 Theo Van Dinter 2006-03-16 20:10:17 UTC
(In reply to comment #27)
> Does this work with 3.1.0?

absolutely.  it won't work with earlier versions though.

> I tried to run it, and it complained that it couldn't find the Logger module,
> which seems to be unnecessary anyway... so I commented that line out.
> 
> Running it again, I saw:
> 
> debug: registering glue method for check_https_http_mismatch
> (Mail::SpamAssassin::Plugin::TVD=HASH(0x1940350))
> Failed to run T_HTTPS_HTTP_MISMATCH SpamAssassin test, skipping:
>         (Can't locate object method "get_uri_detail_list" via package
> "Mail::SpamAssassin::PerMsgStatus" at
> /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Plugin/TVD.pm line 46.
> )
> 
> 
> What am I missing?

you apparently don't have 3.1 installed.  both Logger.pm and get_uri_detail_list() were added in 3.1.0.
Comment 29 Theo Van Dinter 2006-03-17 04:10:08 UTC
A few things.

First, some more prodding around came up with a set of rules/plugin that gets decent results for me.  I 
added it to my sandbox for testing in the nightly mass-check.

Second, it turns out the plugin I put up caused me to find a really annoying bug in SpamAssassin, at 
least 3.1, possibly 3.2, I haven't done enough debugging yet.  In short, put the following above the first 
blank line in the function, and it'll kluge around the issue for now:

  keys %{$uris};

ie:

sub check_https_http_mismatch {
  my ($self, $permsgstatus) = @_;
  my $uris = $permsgstatus->get_uri_detail_list();
  keys %{$uris};

for more information, see bug 4829.
Comment 30 Philip Prindeville 2006-06-24 18:21:09 UTC
Hmm...  What happens if we allow the label to say "http:" but the link is really
to "https:" (but not vice versa)?

And what if all but the host-portion of the domain matches?  I.e. that
www.americanexpress.com and www65.americanexpress.com would be equivalent?

We'd ignore the file-path portion of the URL (sigh) for now...

Can anyone make these two changes and recompute the HAM/SPAM ratios on a test
set of data?


Comment 31 Loren Wilton 2006-06-25 03:55:29 UTC
I think Theo did have stats on the first of those cases once.  But it might not 
hurt to do it again.  I seem to recall once long ago running that test and 
getting just an annoying number of FPs from it.  Checking http://\d vs https:// 
produces considerably better results as I recall.

I don't recall anyone ever trying the second case.  Would be interesting to 
know how it turns out.
Comment 32 Tom Schulz 2006-06-26 18:34:27 UTC
> Hmm...  What happens if we allow the label to say "http:" but the link is
> really to "https:" (but not vice versa)?

See comments 17 18 & 19.
Comment 33 Lorens Kockum 2013-09-02 09:11:42 UTC
(In reply to Theo Van Dinter from comment #24) 
> Hopefully this will put this idea to bed, or allow people to figure out a
> way to deal with the false 
> positives (by all means, if you come up with a useful way to check for this
> sans FPs, let us know!)  As an 
> example of FPs:
> 
> <a
> href="http://www65.americanexpress.com/clicktrk/Tracking?mid=MESSAGEID&msrc=ENG-
> ALERTS&url=https://www.americanexpress.com/estatement/?12345">https://
> www.americanexpress.com/estatement/?12345</a>
> 
> <A
> HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID">https://
> www.hilton.com/en/ww/email/tab_email_subscriptions.jhtml</A>


Can't do anything about the second one, but I am shouldering an annoying amount of phishing that misrepresents the phishing HREF URL with a legitimate URL in the text label.

After coming here to find out why this was not already a high-scoring SA rule, I wrote this which only checks that domains are the same to the second level. I think you'd need a second rule if you want to hit on falsified co.uk and such.

full MISLEADING_URL_LABEL m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?!([^"/]*\.)?\2["/])([^/]*)}
describe MISLEADING_URL_LABEL An A label seems to be a URL but its SLD.TLD differs from the actual URL in the HREF

I would be very interested in learning what effect this has on a big corpus!
Comment 34 Lorens Kockum 2013-09-02 09:17:11 UTC
(In reply to Lorens Kockum from comment #33)
> full MISLEADING_URL_LABEL
> m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?!
> ([^"/]*\.)?\2["/])([^/]*)}

Except that it should be 

full MISLEADING_URL_LABEL m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?!([^"/]*\.)?\2[</])}

Sigh. Mondays.