Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1044

Redirected URLs and possibly all of their outlinked URLs have invalid scores.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.3
    • 1.4
    • fetcher, parser
    • None

    Description

      1.: http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
      2.: http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html

      Please note that also URLs redirected by meta refresh redirection do have invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of ParseOutputFormat.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup). The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).

      It's another question whether the redirected URL's score should be just passed to the new URL or should the redirection be considered as a link in which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' + 1).

      Attachments

        1. NUTCH-1044-1.4.patch
          3 kB
          Julien Nioche

        Activity

          People

            jnioche Julien Nioche
            nutch_user_1 Nutch User - 1
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: