Nutch
  1. Nutch
  2. NUTCH-1044

Redirected URLs and possibly all of their outlinked URLs have invalid scores.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: fetcher, parser
    • Labels:
      None

      Description

      1.: http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
      2.: http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html

      Please note that also URLs redirected by meta refresh redirection do have invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of ParseOutputFormat.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup). The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).

      It's another question whether the redirected URL's score should be just passed to the new URL or should the redirection be considered as a link in which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' + 1).

        Activity

        Hide
        Markus Jelsma added a comment -

        Can you provide a patch?

        Show
        Markus Jelsma added a comment - Can you provide a patch?
        Hide
        Julien Nioche added a comment - - edited

        I can confirm the issue. The solution is not straightforward and needs a bit of thinking.

        The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).

        The score is set in the method initialScore() in the ScoringFilters, see line 81 of OPICScoringFilter which sets it to 0 by default as it expects it to be modified later when getting the contributions from the inlinks.

        There are several ways in which a URL can get a score :

        • specifying the param 'db.score.injected' when injecting (default value = 1.0)
        • passing it in the seed list for each individual URL as a value of the metadata 'nutch.score'
        • from inlinks (depends on the score of the source, number of links etc...)
        • from redirection : which is currently broken

        The default value of the score in CrawlDatum is 1.0 but this could be changed to 0.0. It also has a constructor

        CrawlDatum(int status, int fetchInterval, float score) 
        

        which is allows to specify its score, this constructor is used by the Fetcher when the redirs are refetched immediately however the calls to initialScore() currently set it to 0 immediately.

        We should probably change initialScore() in OPICScoringFilter so that by default it leaves the existing scores as they are and change the default value in CrawlDatum to 0.0. Using the CrawlDatum constructor above with the score of the source of the redir in the code of the Fetcher would fix the issue.

        I will need to look into this and make sure that it has no negative effect + check the cases where the redirection is obtained from a meta refresh tag in the code.

        Thanks for reporting it.

        Show
        Julien Nioche added a comment - - edited I can confirm the issue. The solution is not straightforward and needs a bit of thinking. The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java ( http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup ). The score is set in the method initialScore() in the ScoringFilters, see line 81 of OPICScoringFilter which sets it to 0 by default as it expects it to be modified later when getting the contributions from the inlinks. There are several ways in which a URL can get a score : specifying the param 'db.score.injected' when injecting (default value = 1.0) passing it in the seed list for each individual URL as a value of the metadata 'nutch.score' from inlinks (depends on the score of the source, number of links etc...) from redirection : which is currently broken The default value of the score in CrawlDatum is 1.0 but this could be changed to 0.0. It also has a constructor CrawlDatum( int status, int fetchInterval, float score) which is allows to specify its score, this constructor is used by the Fetcher when the redirs are refetched immediately however the calls to initialScore() currently set it to 0 immediately. We should probably change initialScore() in OPICScoringFilter so that by default it leaves the existing scores as they are and change the default value in CrawlDatum to 0.0. Using the CrawlDatum constructor above with the score of the source of the redir in the code of the Fetcher would fix the issue. I will need to look into this and make sure that it has no negative effect + check the cases where the redirection is obtained from a meta refresh tag in the code. Thanks for reporting it.
        Hide
        Julien Nioche added a comment -

        Fixes the score of redirections by giving them the same score as the source of the redir

        Show
        Julien Nioche added a comment - Fixes the score of redirections by giving them the same score as the source of the redir
        Hide
        Julien Nioche added a comment -

        Will commit soon if there aren't any objections

        Show
        Julien Nioche added a comment - Will commit soon if there aren't any objections
        Hide
        Julien Nioche added a comment -

        Committed revision 1156342.

        Thanks for reporting it

        Show
        Julien Nioche added a comment - Committed revision 1156342. Thanks for reporting it
        Hide
        Markus Jelsma added a comment -

        Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

        Show
        Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

          People

          • Assignee:
            Julien Nioche
            Reporter:
            Nutch User - 1
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development