Details
Description
The fetcher stops following redirects one redirect before the max. redirects is reached.
The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.
I tried to crawl two documents the first redirecting by
<meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.
fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1
- content redirect to file:/test/redirects/to/meta_refresh_target.html (fetching now)
- redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
Of course, this would mean there is no possibility to skip redirects at all since 0
(as well as negative values) means "treat redirects as ordinary links".