Bug 4150 - Provide some form of associative list of URI and anchor text
Provide some form of associative list of URI and anchor text
Status: RESOLVED FIXED
Product: Spamassassin
Classification: Unclassified
Component: spamassassin
SVN Trunk (Latest Devel Version)
Other other
: P3 enhancement
: 3.1.0
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2005-02-23 21:24 UTC by Loren Wilton
Modified: 2005-02-24 02:25 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Loren Wilton 2005-02-23 21:24:31 UTC
It would be helpful to be able to write rules (or evals, or whatever) that 
could compare the URI and the associated anchor text.  For instance, it is 
common in a phish spam to see a uri of http://<dotquad> and an associated 
anchor of https://some.secure.site/Login.  Simply comparing the http: to https: 
is a pretty good spam flag.  There are other interesting tests that can be 
performed by rubbing the uri against the anchor.

Currently this can (sometimes) be accomplished using rawbody and full tests.  
But having a dedicated method of comparing a known uri to the anchor text (with 
choice of raw or rendered) could likely improve both accuracy and efficiency.

According to Theo in Bug=3976:

Yeah, we make both available separately right now, but there's no
correlation between the two pieces of data.  However, this suggestion
would be better serviced via another RFE ticket since it's not related
to the current one.
Comment 1 Theo Van Dinter 2005-02-23 21:43:31 UTC
I think this would be pretty trivial to implement for plugins/eval code.
Unless I'm missing something, it's about 3 lines of perl in HTML.pm. ;)
Comment 2 Theo Van Dinter 2005-02-24 11:25:12 UTC
ok, code committed.  r155227

There's HTML metadata per message, "anchor" is the anchor text per URI. 
"uri_anchor_index" is now a hash of uris which as a value has an array of
indexes into "anchor".  ie:

<a href="http://foo.com/">foo</a>
<a href="http://bar.com/">foo</a>
<a href="http://bar.com/">bar</a>

"anchor" is:

0: foo
1: foo
2: bar

uri_anchor_index is:

http://foo.com/ => [ 0 ],
http://bar.com/ => [ 1, 2 ]