[NUTCH-2748] Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: crawldb, fetcher
Labels:
None

Description

If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.

There are two possible solutions:
1. ignore protocol status "redir_exceeded" during CrawlDb update
2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone

Solution 2. seems easier to implement and it would be possible to make the behavior configurable:

store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
store "fetch_gone" (current behavior)
store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test-NUTCH-2748.zip
08/Nov/19 11:56
10 kB
Sebastian Nagel

Issue Links

links to

Github PR #485

GitHub Pull Request #485

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Oct/19 13:59

Updated:: 28/Jan/21 13:16

Resolved:: 02/Dec/19 11:52