SA Bugzilla – Bug 4255
Suggestion for new rule: Anti-phishing rule.
Last modified: 2013-09-02 09:17:11 UTC
Hi, I've had good luck catching some phishing spams with this rule: full HTTP_CLAIMS_HTTPS /<a[^>]{0,90}http:[^>]{0,90}>[^<]{0,90}https:/is This catches phish that do something like this: Click here: <a href="http:/1.2.3.4/cgi.bin/scam">https://www.paypal.com</a>
Shouldn't that be rawbody, not full?
No, it should be full, not rawbody. Go figure out why, write back if you can't. :-) Regards, David.
Actually, you may need both "rawbody" and "full". The problem is that rawbody matches only a line at a time, so it misses things like: <a href="http://foo.com"> https://www.ebay.com </a> but "full" will miss base-64 encoded bodies. I don't see a proper solution to this until/unless SpamAssassin includes a real HTML parser.
Subject: Re: Suggestion for new rule: Anti-phishing rule. > Shouldn't that be rawbody, not full? It really needs to be both. :-( Rawbody parses text one line at a time, so if the text wraps a line the rule will fail.
Subject: Re: Suggestion for new rule: Anti-phishing rule. > I don't see a proper solution to > this until/unless SpamAssassin includes a real HTML parser. I submitted a bug on this back in 2.63 days that I suspect is still open. The solution is to allow rawbody to process a body and not a line, just like body and full (more or less) can.
Subject: Re: New: Suggestion for new rule: Anti-phishing rule. On Sun, Apr 10, 2005 at 05:22:21PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I've had good luck catching some phishing spams with this rule: > full HTTP_CLAIMS_HTTPS /<a[^>]{0,90}http:[^>]{0,90}>[^<]{0,90}https:/is > This catches phish that do something like this: > Click here: <a href="http:/1.2.3.4/cgi.bin/scam">https://www.paypal.com</a> Seems like HTTPS_IP_MISMATCH ... Already in 3.1. :)
Subject: Re: Suggestion for new rule: Anti-phishing rule. bugzilla-daemon@bugzilla.spamassassin.org wrote: > Seems like HTTPS_IP_MISMATCH ... Already in 3.1. :) Looks similar, but unless I misunderstand that rule, it only looks at IP addresses, not domain names. Regards, David.
Subject: Re: Suggestion for new rule: Anti-phishing rule. On Mon, Apr 11, 2005 at 09:01:46AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Looks similar, but unless I misunderstand that rule, it only looks at > IP addresses, not domain names. As I recall, if it wasn't limited to IPs (which catches the sample you provided BTW), the results weren't very good. I have samples of legit lists that for whatever reason do this type of thing.
(In reply to comment #3) > but "full" will miss base-64 encoded bodies. I don't see a proper solution to > this until/unless SpamAssassin includes a real HTML parser. What's wrong with the current HTML parser? The problem here is trying to implement the rule in the wrong way. It's pretty trivial in an eval, and trying to do it any other way, as mentioned, isn't a very good solution. Anyway, since this rule is already implemented, closing as WFM. BTW: if anyone's interested in the HTTPS_IP_MISMATCH results, from a recent nightly run: OVERALL% SPAM% HAM% S/O RANK SCORE NAME 502218 379437 122781 0.756 0.00 0.00 (all messages) 100.000 75.5523 24.4477 0.756 0.00 0.00 (all messages as %) 0.079 0.1044 0.0000 1.000 0.54 1.00 HTTPS_IP_MISMATCH
Subject: Re: Suggestion for new rule: Anti-phishing rule. Doing a rule that looks for: link: http://.* text: https://.* Meaning the text is implying greater security than provided. That is what the bug is suggesting and it might be useful, maybe try both with and without different hostnames.
Subject: Re: Suggestion for new rule: Anti-phishing rule. bugzilla-daemon@bugzilla.spamassassin.org wrote: > Doing a rule that looks for: > > link: http://.* > text: https://.* > > Meaning the text is implying greater security than provided. Yes, exactly. I can't see why this would lead to false-positives unless someone has made a (very specific) typo. I don't even check the host name or IP address parts. This won't catch a lot of phish, but judging from what I've seen, it should get about 15-25% of them. I'm not extremely familiar with the innards of SpamAssassin's HTML parser; I'll have to look into that. But this would clearly better be done as an eval rule than a regexp rule. Regards, David.
I dont agree with the WORKSFORME closure, because this rule does something quite different from the one in 3.1.
Subject: Re: Suggestion for new rule: Anti-phishing rule. On Mon, Apr 11, 2005 at 03:48:18PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I dont agree with the WORKSFORME closure, because this rule does something quite > different from the one in 3.1. Well, it's actually not "quite different". It's almost exactly the same thing really. The only difference is that the current HTTPS_IP_MISMATCH looks specifically for a URI which goes to an IP and can be for http or https. The rule in question is indiscriminate but limited to http. Ok, so I did up an eval rule and did some tests... If I make it simply look for URIs which start "http:" which have anchor text starting with "https:": (based on the last 14 days) 0.080 0.0979 0.0000 1.000 1.00 0.01 T_HTTPS_URI_MISMATCH If I open it up to have anchor text with /\bhttps:/ instead: 0.123 0.1502 0.0000 1.000 1.00 0.01 T_HTTPS_URI_MISMATCH So I could have sworn that this was tested and rejected for S/O reasons, but either I'm on crack or there was something else which made it horrible at the time (I do have a few FPs in the corpus for legit newsletters back from earlier this year, FWIW...) Anyway, I apologize for premature ticket closure. We really ought to document the stuff we try so that in the future we can just look it up and see if this was tried before or not. <sigh> Anyway, committed my eval version, r160982. :)
closing again. :)
btw I remember one FP I had -- a Paypal Australia URL iirc. anyway, let's see what the nightly results say. that looks pretty good ;)
Apparently my memory is better than I thought: OVERALL% SPAM% HAM% S/O RANK SCORE NAME 505697 381380 124317 0.754 0.00 0.00 (all messages) 100.000 75.4167 24.5833 0.754 0.00 0.00 (all messages as %) 0.120 0.1453 0.0418 0.776 0.49 0.01 T_HTTPS_URI_MISMATCH 0.003 0.0039 0.0000 1.000 0.46 1.00 HTTPS_IP_MISMATCH So it does hit a much better amount of spam (wow HTTPS_IP_MISMATCH hits almost nothing...?), but the S/O ratio is quite nasty. All of my FPs are legit: American Express, Bed Bath & Beyond, Universal Studios, Microsoft, etc. I haven't checked with other folks, but I'd imagine it's similar there.
Is your rule checking the direction of the mismatch? If they claim to be using https but are really using http, that would be bad. If they claim to be using http but are really using https, that would be ok (stupid, but ok).
Subject: Re: Suggestion for new rule: Anti-phishing rule. On Fri, Apr 15, 2005 at 12:35:10PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Is your rule checking the direction of the mismatch? If they claim to be > using https but are really using http, that would be bad. If they claim to > be using http but are really using https, that would be ok (stupid, but ok). Of course. The rule looks for: <a href="http://www.example.com/">https://www.example.com/</a> My guess is that the marketing people want to know who's following the links from the email, so you goto their http click-track server, which redirects you to the secure site.
Subject: Re: Suggestion for new rule: Anti-phishing rule. > ------- Additional Comments From schulz@adi.com 2005-04-15 12:35 ------- > Is your rule checking the direction of the mismatch? If they claim to be > using https but are really using http, that would be bad. If they claim to > be using http but are really using https, that would be ok (stupid, but ok). Yes, I'm amazed there are any FP's with this rule. Wow. I haven't seen any, but then again, the rule has only hit twice for me. Regards, David.
Subject: Re: Suggestion for new rule: Anti-phishing rule. bugzilla-daemon@bugzilla.spamassassin.org wrote: > ------- Additional Comments From felicity@kluge.net 2005-04-15 12:46 ------- > My guess is that the marketing people want to know who's following > the links from the email, so you goto their http click-track server, > which redirects you to the secure site. Well, I'm not sure if it's the SpamAssassin team's mandate to try to change behavior, but that is exactly the kind of boneheaded thing that we should discourage anyone from doing because it hurts security for everyone. I'm probably going to add the rule to our product, but maybe not with a score of 5 (as I use myself.) :-) Regards, David.
Subject: Re: [SPAM]07.19 Suggestion for new rule: Anti-phishing rule. Many of those spams (but not all, of course) also use a numeric ip. Perhaps the other test already does this, but if not, how about something like rawbody T__LW_PHISH_2 m'<a\s+[\s\w=\.]*href=\"https?://\d+[^>]+>https://[^\d]'is
Subject: Re: Suggestion for new rule: Anti-phishing rule. On Sat, Apr 16, 2005 at 09:18:07AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Many of those spams (but not all, of course) also use a numeric ip. Perhaps > the other test already does this, but if not, how about something like HTTPS_IP_MISMATCH already looks for the IP ones.
*** Bug 4372 has been marked as a duplicate of this bug. ***
Since we pretty frequently get this idea posted to the users@ list (twice yesterday!), I'll be putting up a plugin to check for this in a minute. The rule is even more horrible now than it was in the past. Hopefully this will put this idea to bed, or allow people to figure out a way to deal with the false positives (by all means, if you come up with a useful way to check for this sans FPs, let us know!) As an example of FPs: <a href="http://www65.americanexpress.com/clicktrk/Tracking?mid=MESSAGEID&msrc=ENG- ALERTS&url=https://www.americanexpress.com/estatement/?12345">https:// www.americanexpress.com/estatement/?12345</a> <A HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID">https://www.hilton.com/ en/ww/email/tab_email_subscriptions.jhtml</A> Anyway, here's the results... looking for href of http: and anchor text of https with mismatching host sections: MSECS SPAM% HAM% S/O RANK SCORE NAME 0 28446 5023 0.850 0.00 0.00 (all messages) 0.00000 84.9921 15.0079 0.850 0.00 0.00 (all messages as %) 0.335 0.3586 0.1991 0.643 0.00 0.01 T_HTTPS_HTTP_MISMATCH looking for href of https?: and anchor text of https with mismatching host sections (I'm not 100% certain how this hit less ham, but it's still a bad hit rate): MSECS SPAM% HAM% S/O RANK SCORE NAME 0 28446 5023 0.850 0.00 0.00 (all messages) 0.00000 84.9921 15.0079 0.850 0.00 0.00 (all messages as %) 0.329 0.3586 0.1593 0.692 0.00 0.01 T_HTTPS_HTTP_MISMATCH looking for href of https?: and anchor text of https with mismatching domains from the host section: MSECS SPAM% HAM% S/O RANK SCORE NAME 0 28446 5023 0.850 0.00 0.00 (all messages) 0.00000 84.9921 15.0079 0.850 0.00 0.00 (all messages as %) 0.302 0.3340 0.1195 0.737 0.00 0.01 T_HTTPS_HTTP_MISMATCH so unless there's a coding issue, this rule really doesn't work as a spam detection rule.
Created attachment 3416 [details] plugin implementation something like: loadplugin Mail::SpamAssassin::Plugin::TVD body T_HTTPS_HTTP_MISMATCH eval:check_https_http_mismatch()
also tested, domain mismatch between https? href and https? anchor text: MSECS SPAM% HAM% S/O RANK SCORE NAME 0 29213 5299 0.846 0.00 0.00 (all messages) 0.00000 84.6459 15.3541 0.846 0.00 0.00 (all messages as %) 0.898 0.7086 1.9438 0.267 0.00 0.01 T_HTTPS_HTTP_MISMATCH
(In reply to comment #25) > Created an attachment (id=3416) [edit] > plugin implementation > > something like: > > loadplugin Mail::SpamAssassin::Plugin::TVD > body T_HTTPS_HTTP_MISMATCH eval:check_https_http_mismatch() Does this work with 3.1.0? I tried to run it, and it complained that it couldn't find the Logger module, which seems to be unnecessary anyway... so I commented that line out. Running it again, I saw: debug: registering glue method for check_https_http_mismatch (Mail::SpamAssassin::Plugin::TVD=HASH(0x1940350)) Failed to run T_HTTPS_HTTP_MISMATCH SpamAssassin test, skipping: (Can't locate object method "get_uri_detail_list" via package "Mail::SpamAssassin::PerMsgStatus" at /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Plugin/TVD.pm line 46. ) What am I missing?
(In reply to comment #27) > Does this work with 3.1.0? absolutely. it won't work with earlier versions though. > I tried to run it, and it complained that it couldn't find the Logger module, > which seems to be unnecessary anyway... so I commented that line out. > > Running it again, I saw: > > debug: registering glue method for check_https_http_mismatch > (Mail::SpamAssassin::Plugin::TVD=HASH(0x1940350)) > Failed to run T_HTTPS_HTTP_MISMATCH SpamAssassin test, skipping: > (Can't locate object method "get_uri_detail_list" via package > "Mail::SpamAssassin::PerMsgStatus" at > /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Plugin/TVD.pm line 46. > ) > > > What am I missing? you apparently don't have 3.1 installed. both Logger.pm and get_uri_detail_list() were added in 3.1.0.
A few things. First, some more prodding around came up with a set of rules/plugin that gets decent results for me. I added it to my sandbox for testing in the nightly mass-check. Second, it turns out the plugin I put up caused me to find a really annoying bug in SpamAssassin, at least 3.1, possibly 3.2, I haven't done enough debugging yet. In short, put the following above the first blank line in the function, and it'll kluge around the issue for now: keys %{$uris}; ie: sub check_https_http_mismatch { my ($self, $permsgstatus) = @_; my $uris = $permsgstatus->get_uri_detail_list(); keys %{$uris}; for more information, see bug 4829.
Hmm... What happens if we allow the label to say "http:" but the link is really to "https:" (but not vice versa)? And what if all but the host-portion of the domain matches? I.e. that www.americanexpress.com and www65.americanexpress.com would be equivalent? We'd ignore the file-path portion of the URL (sigh) for now... Can anyone make these two changes and recompute the HAM/SPAM ratios on a test set of data?
I think Theo did have stats on the first of those cases once. But it might not hurt to do it again. I seem to recall once long ago running that test and getting just an annoying number of FPs from it. Checking http://\d vs https:// produces considerably better results as I recall. I don't recall anyone ever trying the second case. Would be interesting to know how it turns out.
> Hmm... What happens if we allow the label to say "http:" but the link is > really to "https:" (but not vice versa)? See comments 17 18 & 19.
(In reply to Theo Van Dinter from comment #24) > Hopefully this will put this idea to bed, or allow people to figure out a > way to deal with the false > positives (by all means, if you come up with a useful way to check for this > sans FPs, let us know!) As an > example of FPs: > > <a > href="http://www65.americanexpress.com/clicktrk/Tracking?mid=MESSAGEID&msrc=ENG- > ALERTS&url=https://www.americanexpress.com/estatement/?12345">https:// > www.americanexpress.com/estatement/?12345</a> > > <A > HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID">https:// > www.hilton.com/en/ww/email/tab_email_subscriptions.jhtml</A> Can't do anything about the second one, but I am shouldering an annoying amount of phishing that misrepresents the phishing HREF URL with a legitimate URL in the text label. After coming here to find out why this was not already a high-scoring SA rule, I wrote this which only checks that domains are the same to the second level. I think you'd need a second rule if you want to hit on falsified co.uk and such. full MISLEADING_URL_LABEL m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?!([^"/]*\.)?\2["/])([^/]*)} describe MISLEADING_URL_LABEL An A label seems to be a URL but its SLD.TLD differs from the actual URL in the HREF I would be very interested in learning what effect this has on a big corpus!
(In reply to Lorens Kockum from comment #33) > full MISLEADING_URL_LABEL > m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?! > ([^"/]*\.)?\2["/])([^/]*)} Except that it should be full MISLEADING_URL_LABEL m{href="https?://([^"/]*\.)?([^\."/]+\.[^\."/]+)(/[^"]*)?"[^>]*>https?://(?!([^"/]*\.)?\2[</])} Sigh. Mondays.