Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
In order to check if an outlink in an <a> tag can be followed, nutch checks whether the value of its rel attribute is the exact string string "nofollow".
However, the rel attribute can contain a list of link types, all of which should be respected.
So nutch rightfully doesn't follow a link like:
<a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
but wrongfully follows :
<a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
Because of the code duplication in nutch's html parsers, this should be fixed in two places:
Attachments
Issue Links
- links to