Nutch
  1. Nutch
  2. NUTCH-693

Add configurable option for treating nofollow behaviour.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them.

      I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page.

        Issue Links

          Activity

          Andrew McCall created issue -
          Hide
          Andrew McCall added a comment -

          Here is the patch.

          Show
          Andrew McCall added a comment - Here is the patch.
          Andrew McCall made changes -
          Field Original Value New Value
          Attachment nutch.nofollow.patch [ 12400444 ]
          Otis Gospodnetic made changes -
          Assignee Otis Gospodnetic [ otis ]
          Hide
          Otis Gospodnetic added a comment -

          I think I see some formatting that's a bit off (looks off in the patch itself at least), but more importantly, is everyone OK with allowing this behaviour?

          +1 from me – let the operators decide.

          Show
          Otis Gospodnetic added a comment - I think I see some formatting that's a bit off (looks off in the patch itself at least), but more importantly, is everyone OK with allowing this behaviour? +1 from me – let the operators decide.
          Hide
          Andrzej Bialecki added a comment -

          This patch is controversial in the sense that a) Nutch strives to adhere to Internet standards and netiquette, which says that robots should obey nofollow, and b) most Nutch users want a well-behaved robot. You are free of course to modify the source as you did. Therefore I think that this functionality is not applicable to majority of Nutch users, and I vote -1 on including it in Nutch.

          Show
          Andrzej Bialecki added a comment - This patch is controversial in the sense that a) Nutch strives to adhere to Internet standards and netiquette, which says that robots should obey nofollow, and b) most Nutch users want a well-behaved robot. You are free of course to modify the source as you did. Therefore I think that this functionality is not applicable to majority of Nutch users, and I vote -1 on including it in Nutch.
          Hide
          Andrew McCall added a comment -

          http://en.wikipedia.org/wiki/Nofollow

          I don't think there is really any consensus on this standard to be honest. Most search engines don't index no-follow links per se, but they do follow them for crawling. Even Google, who first proposed the nofollow, sometimes actually do follow according to some tests linked in the wikipedia article. The results show that if the link is already in the index (eg has been followed elsewhere) then it does get followed and indexed.

          The nofollow is really just a keyword to point out that the link isn't being endorsed by the author - It's more a content guideline than a strict order for robots to obey. So I disagree that you're breaking standards or creating a robot that's not well behaved by ignoring it.

          I would have liked to have done a bit more with this so that I could have respected nofollows, but injected the URL as a brand new seed URL but other commitments took over and I never got around to it. Since the ideal nofollow behaviour is somewhere between ignoring them and not ignoring them I figured the option to ignore them was a good start and submitted the patch, but I'm not precious about it.

          Show
          Andrew McCall added a comment - http://en.wikipedia.org/wiki/Nofollow I don't think there is really any consensus on this standard to be honest. Most search engines don't index no-follow links per se, but they do follow them for crawling. Even Google, who first proposed the nofollow, sometimes actually do follow according to some tests linked in the wikipedia article. The results show that if the link is already in the index (eg has been followed elsewhere) then it does get followed and indexed. The nofollow is really just a keyword to point out that the link isn't being endorsed by the author - It's more a content guideline than a strict order for robots to obey. So I disagree that you're breaking standards or creating a robot that's not well behaved by ignoring it. I would have liked to have done a bit more with this so that I could have respected nofollows, but injected the URL as a brand new seed URL but other commitments took over and I never got around to it. Since the ideal nofollow behaviour is somewhere between ignoring them and not ignoring them I figured the option to ignore them was a good start and submitted the patch, but I'm not precious about it.
          Hide
          Andrzej Bialecki added a comment -

          Thanks for the pointer to the article. Indeed, the issue is muddy at best. So far Nutch adhered to a strict interpretation, where the links with this attribute are deleted from page outlinks immediately (so they are not only not followed but also don't affect out-degree metrics). If there is a general agreement in Nutch community towards relaxing this behavior we can further develop this patch - at the moment I don't see such support. Consequently, I propose to discuss it and in the meantime to move this issue to a later release.

          Show
          Andrzej Bialecki added a comment - Thanks for the pointer to the article. Indeed, the issue is muddy at best. So far Nutch adhered to a strict interpretation, where the links with this attribute are deleted from page outlinks immediately (so they are not only not followed but also don't affect out-degree metrics). If there is a general agreement in Nutch community towards relaxing this behavior we can further develop this patch - at the moment I don't see such support. Consequently, I propose to discuss it and in the meantime to move this issue to a later release.
          Andrzej Bialecki made changes -
          Assignee Otis Gospodnetic [ otis ]
          Lewis John McGibbney made changes -
          Link This issue blocks NUTCH-795 [ NUTCH-795 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 2.2 [ 12323285 ]
          Hide
          Markus Jelsma added a comment -

          Vote for `won't fix`. We also don't implement an ignore.robotstxt option for the above reasons.

          Show
          Markus Jelsma added a comment - Vote for `won't fix`. We also don't implement an ignore.robotstxt option for the above reasons.
          Hide
          Lewis John McGibbney added a comment -

          +1 Markus. Please close off when you can.

          Show
          Lewis John McGibbney added a comment - +1 Markus. Please close off when you can.
          Markus Jelsma made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 2.2 [ 12323285 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Santiago M. Mola added a comment -

          This is completely different from an hypothetical "ignore.robots.txt" option. "robots.txt" is controlled by the site owner, and it tells us explicitely not to access/index some parts of the website. rel=nofollow is usually controlled by third-parties and it's not supposed to restrict crawling. It's just for preventing the link from adding up in link scoring algorithms (or, as Andrew put it, non-endorsement).

          But what is more important: What happpens when your seeds use rel=nofollow? Then Nutch cannot crawl anything. For example, most MediaWiki setups include rel=nofollow for all external links. That means that, if you need to use a MediaWiki-based site as a seed, Nutch will not be able to extract links for further crawling.

          Show
          Santiago M. Mola added a comment - This is completely different from an hypothetical "ignore.robots.txt" option. "robots.txt" is controlled by the site owner, and it tells us explicitely not to access/index some parts of the website. rel=nofollow is usually controlled by third-parties and it's not supposed to restrict crawling. It's just for preventing the link from adding up in link scoring algorithms (or, as Andrew put it, non-endorsement). But what is more important: What happpens when your seeds use rel=nofollow? Then Nutch cannot crawl anything. For example, most MediaWiki setups include rel=nofollow for all external links. That means that, if you need to use a MediaWiki-based site as a seed, Nutch will not be able to extract links for further crawling.
          Gavin made changes -
          Link This issue blocks NUTCH-795 [ NUTCH-795 ]
          Gavin made changes -
          Link This issue is depended upon by NUTCH-795 [ NUTCH-795 ]
          Hide
          Lewis John McGibbney added a comment -

          Hi Santiago, if you would like to update the patch then please do so. Patch against trunk and/or 2.x HEAD and we will see where this goes.

          Show
          Lewis John McGibbney added a comment - Hi Santiago, if you would like to update the patch then please do so. Patch against trunk and/or 2.x HEAD and we will see where this goes.

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew McCall
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development