Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2634

Some links marked as "nofollow" are followed anyway.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.20
    • None
    • None

    Description

      In order to check if an outlink in an <a> tag can be followed, nutch checks whether the value of its rel attribute is the exact string string "nofollow".
      However, the rel attribute can contain a list of link types, all of which should be respected.

      So nutch rightfully doesn't follow a link like:

      <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
      

      but wrongfully follows :

      <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
      

      Because of the code duplication in nutch's html parsers, this should be fixed in two places:

      1. parse/html/DOMContentUtils.java
      2. parse/tika/DOMContentUtils.java

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gbouchar Gerard Bouchar
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: