Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-710

Support for rel="canonical" attribute

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      There is a the new rel="canonical" attribute which is
      now being supported by Google, Yahoo, and Live:

      http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

      Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content.

      1. canonical.patch
        3 kB
        zm
      2. NUTCH-710.patch
        52 kB
        Sertac TURKEL

        Issue Links

          Activity

          Hide
          jnioche Julien Nioche added a comment -

          Great idea. Won't be included in 1.1 though so moving to fix : unknown

          Show
          jnioche Julien Nioche added a comment - Great idea. Won't be included in 1.1 though so moving to fix : unknown
          Hide
          jnioche Julien Nioche added a comment -

          Shall we treat pages with a canonical metatag as a form of redirection? We know that there is no point indexing the page and that we'd be better off making sure that the page it refers to is fetched, parsed and indexed. That will not prevent these entries to be put in the crawlDB but should limit the size of the index and more importantly its quality.

          Alternatively we could keep the content of the page for indexing and rely on the de-duplication later. This would allow to have something returned in the search even if the target of the canonical tag has not been indexed yet (or if it does not exist).

          The first option would be easier to implement. The second option would require some adaptation to the DeleteDuplicates and SolrDeleteDuplicates classes

          Any thoughts on this?

          Show
          jnioche Julien Nioche added a comment - Shall we treat pages with a canonical metatag as a form of redirection? We know that there is no point indexing the page and that we'd be better off making sure that the page it refers to is fetched, parsed and indexed. That will not prevent these entries to be put in the crawlDB but should limit the size of the index and more importantly its quality. Alternatively we could keep the content of the page for indexing and rely on the de-duplication later. This would allow to have something returned in the search even if the target of the canonical tag has not been indexed yet (or if it does not exist). The first option would be easier to implement. The second option would require some adaptation to the DeleteDuplicates and SolrDeleteDuplicates classes Any thoughts on this?
          Hide
          jnioche Julien Nioche added a comment -

          As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions.

          Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks.
          Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections

          We probably need a third approach: prefilter by going through the crawldb & detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway)

          As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then
          -> if indexer comes across such an entry : skip it
          -> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices

          The implementation would be as follows :

          Go through all redirections and generate all redirection chains e.g.

          A -> B
          B -> C
          D -> C

          where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed.

          will yield

          A -> C
          B -> C
          D -> C

          but also

          C -> C

          Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate'

          This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates.

          This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB.

          WDYT?

          Show
          jnioche Julien Nioche added a comment - As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions. Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks. Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections We probably need a third approach: prefilter by going through the crawldb & detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway) As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then -> if indexer comes across such an entry : skip it -> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices The implementation would be as follows : Go through all redirections and generate all redirection chains e.g. A -> B B -> C D -> C where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed. will yield A -> C B -> C D -> C but also C -> C Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate' This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates. This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB. WDYT?
          Hide
          praveenjayaraman@yahoo.com Praveen Jayaraman added a comment -

          Hello -

          I am having two problems with Nutch and am hoping that you can help me out.

          a) Crawling does not use link rel="canonical" to index the links.

          b) Crawling ignores robots.txt.

          I am currently using Nutch 1.1 for crawling my local company site

          I have tried various settings from the web forums but am unable to get the above
          issues working.

          Can you tell me how to enable these while crawling.

          Appreciate your answer

          Thanks in advance,
          Regards
          Praveen.

          Show
          praveenjayaraman@yahoo.com Praveen Jayaraman added a comment - Hello - I am having two problems with Nutch and am hoping that you can help me out. a) Crawling does not use link rel="canonical" to index the links. b) Crawling ignores robots.txt. I am currently using Nutch 1.1 for crawling my local company site I have tried various settings from the web forums but am unable to get the above issues working. Can you tell me how to enable these while crawling. Appreciate your answer Thanks in advance, Regards Praveen.
          Hide
          markus17 Markus Jelsma added a comment -

          Putting a useful issue back on the radar. Fix for 2.0?

          Show
          markus17 Markus Jelsma added a comment - Putting a useful issue back on the radar. Fix for 2.0?
          Hide
          lewismc Lewis John McGibbney added a comment -

          Set and Classify

          Show
          lewismc Lewis John McGibbney added a comment - Set and Classify
          Hide
          iwanluijks Iwan Luijks added a comment -

          It seems the Fix Version of this issue keeps getting higher, while being created back in 2009. Can we say the 1.6/2.2 is final? I don't think anyone disagrees when I say the feature would be definitely useful.

          Show
          iwanluijks Iwan Luijks added a comment - It seems the Fix Version of this issue keeps getting higher, while being created back in 2009. Can we say the 1.6/2.2 is final? I don't think anyone disagrees when I say the feature would be definitely useful.
          Hide
          jnioche Julien Nioche added a comment -

          Iwan : sure, feel free to send a patch if you want to help it happen

          Show
          jnioche Julien Nioche added a comment - Iwan : sure, feel free to send a patch if you want to help it happen
          Hide
          iwanluijks Iwan Luijks added a comment - - edited

          Currently looking into creating a patch which should handle the different Canonical features, will post again when I have a stable working version. For others, Google webmaster central's page about canonical pages and links is revised and more information is added: http://support.google.com/webmasters/bin/answer.py?hl=nl&answer=139394.

          Show
          iwanluijks Iwan Luijks added a comment - - edited Currently looking into creating a patch which should handle the different Canonical features, will post again when I have a stable working version. For others, Google webmaster central's page about canonical pages and links is revised and more information is added: http://support.google.com/webmasters/bin/answer.py?hl=nl&answer=139394 .
          Hide
          zm zm added a comment -

          I have implemented non-canonical page detection by means of modifying parse-html plugin. Though I am not that familiar with Nutch architecture and not sure if my implementation is in line with it. What I did is to have utility method boolean isCanonical(Node root, String baseUrl) which returns status of currently parsed html page: true if it is proper page, false if it is non canonical. Then in parse-html plugin's HtmlParser.getParse I call isCanonical and return from the method with ParseStatus

          if(!utils.isCanonical(root, baseUrl)
          return ParseStatusUtils.getEmptyParse(9999, "Non canonical page", getConf());

          Is this the right way to do it (of cause this needs to be made configurable)? If someone more familiar with Nutch would confirm or suggest more proper way I'd submit a patch.

          Show
          zm zm added a comment - I have implemented non-canonical page detection by means of modifying parse-html plugin. Though I am not that familiar with Nutch architecture and not sure if my implementation is in line with it. What I did is to have utility method boolean isCanonical(Node root, String baseUrl) which returns status of currently parsed html page: true if it is proper page, false if it is non canonical. Then in parse-html plugin's HtmlParser.getParse I call isCanonical and return from the method with ParseStatus if(!utils.isCanonical(root, baseUrl) return ParseStatusUtils.getEmptyParse(9999, "Non canonical page", getConf()); Is this the right way to do it (of cause this needs to be made configurable)? If someone more familiar with Nutch would confirm or suggest more proper way I'd submit a patch.
          Hide
          lewismc Lewis John McGibbney added a comment -

          If you could submit a working patch for this issue we can get cracking on testing and passing comments. Something is better than nothing. Thank you.

          Show
          lewismc Lewis John McGibbney added a comment - If you could submit a working patch for this issue we can get cracking on testing and passing comments. Something is better than nothing. Thank you.
          Hide
          zm zm added a comment - - edited

          Attaching the patch works with 2x branch (1x has different API). See comments in the code and my previous message for details how it works and issues to be solved.

          Show
          zm zm added a comment - - edited Attaching the patch works with 2x branch (1x has different API). See comments in the code and my previous message for details how it works and issues to be solved.
          Hide
          jnorrisAPL Joshua Norris added a comment -

          This feature would be awesome, noticed it got put on 1.8, is this actually coming out with 1.8?

          Show
          jnorrisAPL Joshua Norris added a comment - This feature would be awesome, noticed it got put on 1.8, is this actually coming out with 1.8?
          Hide
          jnioche Julien Nioche added a comment -

          Nope. The version tag is more of a reminder that we'd like to push it to that release, not a firm commitment that it will be so. If an issue is marked as resolved then you can be sure it will be part of said release though.

          Show
          jnioche Julien Nioche added a comment - Nope. The version tag is more of a reminder that we'd like to push it to that release, not a firm commitment that it will be so. If an issue is marked as resolved then you can be sure it will be part of said release though.
          Hide
          msertacturkel Sertac TURKEL added a comment -

          hi Julien Nioche Lewis John McGibbney, I want to work about this issue for 2x branch. What is the last decision about the issue.

          Show
          msertacturkel Sertac TURKEL added a comment - hi Julien Nioche Lewis John McGibbney , I want to work about this issue for 2x branch. What is the last decision about the issue.
          Hide
          lewismc Lewis John McGibbney added a comment -

          hi Sertac TURKEL, by the looks of the patch attached, it requires to be possibly extended to also cover parse-tika plugin as well as unit tests for both plugins. That would probably be it I suppose. We can then port it to 1.X API. If you were able to take this on it would be great.

          Show
          lewismc Lewis John McGibbney added a comment - hi Sertac TURKEL , by the looks of the patch attached, it requires to be possibly extended to also cover parse-tika plugin as well as unit tests for both plugins. That would probably be it I suppose. We can then port it to 1.X API. If you were able to take this on it would be great.
          Hide
          msertacturkel Sertac TURKEL added a comment -

          Hi Lewis John McGibbney ,

          I prepared a patch file to solve this issue for 2.x branch. Patch file also covers test cases for tika parser and html parser plugins.Could you review my patch files.

          Show
          msertacturkel Sertac TURKEL added a comment - Hi Lewis John McGibbney , I prepared a patch file to solve this issue for 2.x branch. Patch file also covers test cases for tika parser and html parser plugins.Could you review my patch files.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Thanks, Sertac TURKEL! My comments:

          • every page containing a canonical link is now rejected. That's a rather hard decision. It should be configurable whether pages containing correct (non-empty, not self-referential, etc.) canonical links
            1. are unconditionally rejected
            2. are removed later only if the target is indexed. It's close to deduplication, and it's what canonical links are intended for: give web masters a chance to support and influence deduplication.
            3. are only recorded (as outlinks and/or as indexed fields)
              This point is the most challenging one: you need to take care for all nasty situations "in the wild", e.g. a canonical link pointing to a redirect which leads you back to the current page, etc. It's required to "resolve" chains of canonical links in combination with redirects, see Julien's comment and 1.
          • is it really necessary to handle canonical links explicitely in DbUpdateMapper and mark as injected? Couldn't this be done by adding them simply as outlinks? Per default links of "link" elements are added as outlinks, cf. parser.html.outlinks.ignore_tags. Of course, canonical links should be added even if "link" elements are ignored.
          • extraction of canonical links: at least, the following points are missing: relative URLs, and canonical link inside HTTP headers (required for anything which is not HTML). I'll try support you in this point because there's already some work done.
          • keep names in parallel?
            src/plugin/parse-html/.../TestDOMContentUtils.java
            src/plugin/parse-tika/.../DOMContentUtilsTest.java
            

          ... and some useful references:
          http://en.wikipedia.org/wiki/Canonical_link_element
          http://tools.ietf.org/html/rfc6596
          https://support.google.com/webmasters/answer/139066
          http://www.mattcutts.com/blog/rel-canonical-html-head/
          http://googlewebmastercentral.blogspot.de/2011/06/supporting-relcanonical-http-headers.html

          Show
          wastl-nagel Sebastian Nagel added a comment - Thanks, Sertac TURKEL ! My comments: every page containing a canonical link is now rejected. That's a rather hard decision. It should be configurable whether pages containing correct (non-empty, not self-referential, etc.) canonical links are unconditionally rejected are removed later only if the target is indexed. It's close to deduplication, and it's what canonical links are intended for: give web masters a chance to support and influence deduplication. are only recorded (as outlinks and/or as indexed fields) This point is the most challenging one: you need to take care for all nasty situations "in the wild", e.g. a canonical link pointing to a redirect which leads you back to the current page, etc. It's required to "resolve" chains of canonical links in combination with redirects, see Julien's comment and 1 . is it really necessary to handle canonical links explicitely in DbUpdateMapper and mark as injected? Couldn't this be done by adding them simply as outlinks? Per default links of "link" elements are added as outlinks, cf. parser.html.outlinks.ignore_tags. Of course, canonical links should be added even if "link" elements are ignored. extraction of canonical links: at least, the following points are missing: relative URLs, and canonical link inside HTTP headers (required for anything which is not HTML). I'll try support you in this point because there's already some work done. keep names in parallel? src/plugin/parse-html/.../TestDOMContentUtils.java src/plugin/parse-tika/.../DOMContentUtilsTest.java ... and some useful references: http://en.wikipedia.org/wiki/Canonical_link_element http://tools.ietf.org/html/rfc6596 https://support.google.com/webmasters/answer/139066 http://www.mattcutts.com/blog/rel-canonical-html-head/ http://googlewebmastercentral.blogspot.de/2011/06/supporting-relcanonical-http-headers.html

            People

            • Assignee:
              Unassigned
              Reporter:
              fmccown Frank McCown
            • Votes:
              6 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development