Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2749

Fetcher and scoring-opic: transfer score to redirects

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16
    • 1.21
    • fetcher, plugin, scoring
    • None

    Description

      See the discussion "Score value lost after two successive redirects" dating back to 2012.

      Redirects should be enabled to pass scores to the targets. This is mandatory for reliable scoring, otherwise scores often get lost when a link target is redirected. Eg. when the target site has moved from http:// to https://, incoming links to http:// pages are usually redirected to https:// (on the target site), and the incoming score is lost. If the migration to https:// happened recently the scores for this site might just become zero.

      I aggree with markus17's comment in the mentioned discussion @user that "it cannot be a good idea to just copy over the score". Instead redirects should have the same effect as a page containing a single href link.

      This would require the following change(s):

      1. in Fetcher (class FetcherThread): the score should be passed forward to the redirect target

      • because the method distributeScoreToOutlinks(...) cannot be called for redirects (no content is parsed) we would need a dedicated hook
        distributeScoreToRedirect(Text fromUrl, Text toUrl, CrawlDatum source, CrawlDatum target)
      • to be called both for "recorded" and followed redirects (depending on http.max.redirect)
      • scoring strategies can be implemented there, eg. apply "db.score.link.{internal,external}"
      • to be implemented as default method which avoids that existing scoring filter plugins are broken

      2. during CrawlDb update (class CrawlDbReducer), there are different cases to consider:

      a. URL not yet in CrawlDb: nothing to do if the score has been already passed forward (step 1)

      b. URL already in CrawlDb, redirects not followed in fetcher (htt.redirect.max == 0): the redirect target has been stored as db_outlink, so it will be used in the scoring method updateDbScore(...) -> nothing to do

      c. URL already in CrawlDb, fetcher follows redirects: to get the same behavior as for incoming links we would need to mark fetches stemming from a followed redirect and use them in a modified updateDbScore(...)

      Being pragmatic I would address in this issue only point 1 and (implicitely 2a and 2b). Point 2c would require significant changes and isn't easy to control in the worst case, if there are multiple redirects followed all ending in the same target

      Attachments

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: