Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2748

Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: crawldb, fetcher
    • Labels:
      None

      Description

      If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.

      There are two possible solutions:
      1. ignore protocol status "redir_exceeded" during CrawlDb update
      2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone

      Solution 2. seems easier to implement and it would be possible to make the behavior configurable:

      • store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
      • store "fetch_gone" (current behavior)
      • store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone

        Attachments

        1. test-NUTCH-2748.zip
          10 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: