[NUTCH-962] max. redirects not handled correctly: fetcher stops at max-1 redirects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2, 1.3, nutchgora
Fix Version/s: 1.3, nutchgora
Component/s: fetcher
Labels:
None

Patch Info:

Patch Available

Description

The fetcher stops following redirects one redirect before the max. redirects is reached.

The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.

I tried to crawl two documents the first redirecting by
<meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.

fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1

content redirect to file:/test/redirects/to/meta_refresh_target.html (fetching now)
redirect count exceeded file:/test/redirects/to/meta_refresh_target.html

The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
Of course, this would mean there is no possibility to skip redirects at all since 0
(as well as negative values) means "treat redirects as ordinary links".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Fetcher_redir.patch
26/Jan/11 15:22
1 kB
Sebastian Nagel

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 26/Jan/11 15:20

Updated:: 25/Jun/11 12:53

Resolved:: 09/Mar/11 12:03