Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2748

Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.16
    • 1.17
    • crawldb, fetcher
    • None

    Description

      If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.

      There are two possible solutions:
      1. ignore protocol status "redir_exceeded" during CrawlDb update
      2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone

      Solution 2. seems easier to implement and it would be possible to make the behavior configurable:

      • store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
      • store "fetch_gone" (current behavior)
      • store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone

      Attachments

        1. test-NUTCH-2748.zip
          10 kB
          Sebastian Nagel

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: