Bug 4150

Summary: Provide some form of associative list of URI and anchor text
Product: Spamassassin Reporter: Loren Wilton <lwilton>
Component: spamassassinAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P3    
Version: SVN Trunk (Latest Devel Version)   
Target Milestone: 3.1.0   
Hardware: Other   
OS: other   
Whiteboard:

Description Loren Wilton 2005-02-23 21:24:31 UTC
It would be helpful to be able to write rules (or evals, or whatever) that 
could compare the URI and the associated anchor text.  For instance, it is 
common in a phish spam to see a uri of http://<dotquad> and an associated 
anchor of https://some.secure.site/Login.  Simply comparing the http: to https: 
is a pretty good spam flag.  There are other interesting tests that can be 
performed by rubbing the uri against the anchor.

Currently this can (sometimes) be accomplished using rawbody and full tests.  
But having a dedicated method of comparing a known uri to the anchor text (with 
choice of raw or rendered) could likely improve both accuracy and efficiency.

According to Theo in Bug=3976:

Yeah, we make both available separately right now, but there's no
correlation between the two pieces of data.  However, this suggestion
would be better serviced via another RFE ticket since it's not related
to the current one.
Comment 1 Theo Van Dinter 2005-02-23 21:43:31 UTC
I think this would be pretty trivial to implement for plugins/eval code.
Unless I'm missing something, it's about 3 lines of perl in HTML.pm. ;)
Comment 2 Theo Van Dinter 2005-02-24 11:25:12 UTC
ok, code committed.  r155227

There's HTML metadata per message, "anchor" is the anchor text per URI. 
"uri_anchor_index" is now a hash of uris which as a value has an array of
indexes into "anchor".  ie:

<a href="http://foo.com/">foo</a>
<a href="http://bar.com/">foo</a>
<a href="http://bar.com/">bar</a>

"anchor" is:

0: foo
1: foo
2: bar

uri_anchor_index is:

http://foo.com/ => [ 0 ],
http://bar.com/ => [ 1, 2 ]