The fetcher stops following redirects one redirect before the max. redirects is reached.
The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.
I tried to crawl two documents the first redirecting by
<meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.
-finishing thread FetcherThread, activeThreads=1
- content redirect to file:/test/redirects/to/meta_refresh_target.html (fetching now)
- redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
Of course, this would mean there is no possibility to skip redirects at all since 0
(as well as negative values) means "treat redirects as ordinary links".
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Assignee||Andrzej Bialecki [ ab ]|
|Fix Version/s||1.3 [ 12315470 ]|
|Fix Version/s||2.0 [ 12314893 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|