Nutch
  1. Nutch
  2. NUTCH-1419

parsechecker and indexchecker to report protocol status

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: nutchgora, 1.6
    • Fix Version/s: 1.7, 2.2
    • Component/s: indexer, parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Parsechecker and indexchecker should report the protocol status when the fetch was not successful (status other than 200/ok).

      In case of a redirect, the protocol status contains the URL a redirect points to. Usually, this URL should be checked instead of the original one which is not indexed. The content of a redirect response is less useful (and often empty):

      % nutch indexchecker http://lucene.apache.org/nutch/
      fetching: http://lucene.apache.org/nutch/
      parsing: http://lucene.apache.org/nutch/
      contentType: text/html
      content :       301 Moved Permanently Moved Permanently The document has moved here . Apache/2.4.1 (Unix) OpenSSL/1.
      title : 301 Moved Permanently
      host :  lucene.apache.org
      tstamp :        Tue Jul 03 13:27:32 CEST 2012
      url :   http://lucene.apache.org/nutch/
      
      1. NUTCH-1419-1.patch
        2 kB
        Sebastian Nagel
      2. NUTCH-1419-2.x.patch
        2 kB
        Lewis John McGibbney
      3. NUTCH-1419-2.x-v2.patch
        4 kB
        Lewis John McGibbney
      4. NUTCH-1419-2.x-v2.patch
        4 kB
        Lewis John McGibbney
      5. NUTCH-1419-trunk.patch
        2 kB
        Lewis John McGibbney

        Activity

        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #2146 (See https://builds.apache.org/job/Nutch-trunk/2146/)
        NUTCH-1419 parsechecker and indexchecker to report protocol status (Revision 1461276)

        Result = SUCCESS
        lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1461276
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
        • /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
        Show
        Hudson added a comment - Integrated in Nutch-trunk #2146 (See https://builds.apache.org/job/Nutch-trunk/2146/ ) NUTCH-1419 parsechecker and indexchecker to report protocol status (Revision 1461276) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1461276 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #546 (See https://builds.apache.org/job/Nutch-nutchgora/546/)
        NUTCH-1419 parsechecker and indexchecker to report protocol status (Revision 1461274)

        Result = SUCCESS
        lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1461274
        Files :

        • /nutch/branches/2.x/CHANGES.txt
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserChecker.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #546 (See https://builds.apache.org/job/Nutch-nutchgora/546/ ) NUTCH-1419 parsechecker and indexchecker to report protocol status (Revision 1461274) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1461274 Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserChecker.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java
        Hide
        Sebastian Nagel added a comment -

        Thanks Lewis!

        Show
        Sebastian Nagel added a comment - Thanks Lewis!
        Hide
        Lewis John McGibbney added a comment -

        Committed @revision 1461274 in 2.x branch
        Committed @ revision 1461276 in trunk branch
        Thank you Sebastian for the patches and reviews.
        Sebastian Nagel please resolve when you can.

        Show
        Lewis John McGibbney added a comment - Committed @revision 1461274 in 2.x branch Committed @ revision 1461276 in trunk branch Thank you Sebastian for the patches and reviews. Sebastian Nagel please resolve when you can.
        Hide
        Lewis John McGibbney added a comment -

        and the imports so that it compiles

        Show
        Lewis John McGibbney added a comment - and the imports so that it compiles
        Hide
        Lewis John McGibbney added a comment -

        updated patch for trunk which takes on Sebs comments and also implements correct logging mechanisms.

        Show
        Lewis John McGibbney added a comment - updated patch for trunk which takes on Sebs comments and also implements correct logging mechanisms.
        Hide
        Lewis John McGibbney added a comment -

        Same as before. I am +1 for this issue to be implemented in both trunk and 2.x branches. Please say if you can commit or not Seb. Thank you

        Show
        Lewis John McGibbney added a comment - Same as before. I am +1 for this issue to be implemented in both trunk and 2.x branches. Please say if you can commit or not Seb. Thank you
        Hide
        Sebastian Nagel added a comment -

        Hi Lewis,

        +1 for NUTCH-1419-trunk.patch (parsechecker and indexchecker).

        For NUTCH-1419-2.x.patch (parsechecker only): the error message

        2013-03-25 23:29:40,000 ERROR parse.ParserChecker - Fetch failed with protocol status: org.apache.nutch.storage.ProtocolStatus@1b7d0 {
          "code":"12"
          "args":"[http://www.apachecon.eu/]"
          "lastModified":"0"
        }
        

        could be improved using ProtocolStatusUtils.getName and .getMessage, cf. the patch for indexchecker in NUTCH-1038. A "moved" or "moved(12)" is more informative.

        Show
        Sebastian Nagel added a comment - Hi Lewis, +1 for NUTCH-1419 -trunk.patch (parsechecker and indexchecker). For NUTCH-1419 -2.x.patch (parsechecker only): the error message 2013-03-25 23:29:40,000 ERROR parse.ParserChecker - Fetch failed with protocol status: org.apache.nutch.storage.ProtocolStatus@1b7d0 { "code" : "12" "args" : "[http: //www.apachecon.eu/]" "lastModified" : "0" } could be improved using ProtocolStatusUtils.getName and .getMessage, cf. the patch for indexchecker in NUTCH-1038 . A "moved" or "moved(12)" is more informative.
        Hide
        Lewis John McGibbney added a comment -

        updated patch for trunk which accommodates the changes made to the codebase since Seb initially uploaded his patch.
        Also uploaded patch for 2.x with addition of .isSuccess method to keep consistency with trunk.
        Please review and comment.
        Thank you

        Show
        Lewis John McGibbney added a comment - updated patch for trunk which accommodates the changes made to the codebase since Seb initially uploaded his patch. Also uploaded patch for 2.x with addition of .isSuccess method to keep consistency with trunk. Please review and comment. Thank you
        Hide
        Lewis John McGibbney added a comment -

        +1

        Show
        Lewis John McGibbney added a comment - +1
        Hide
        Markus Jelsma added a comment -

        +1

        Show
        Markus Jelsma added a comment - +1
        Hide
        Sebastian Nagel added a comment -

        Simple patch: in case of a protocol status other than 200 (success):

        1. report the protocol status
        2. exit (since those documents are not parsed and indexed when crawling: parsechecker and indexchecker should behave similar to an "ordinary" crawl)
        Show
        Sebastian Nagel added a comment - Simple patch: in case of a protocol status other than 200 (success): report the protocol status exit (since those documents are not parsed and indexed when crawling: parsechecker and indexchecker should behave similar to an "ordinary" crawl)

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development