Description
As I detailed in this github comment, it appears that PR #221 broke redirects. The fetcher will repeatedly fetch the original url rather than the one it's supposed to be redirecting to until http.redirect.max is exceeded, and then end with STATUS_FETCH_GONE.
I noticed this issue when I was trying to crawl a site with a 301 MOVED PERMANENTLY status code.
Should be pretty easy to fix though: I was able to get redirects working again simply by inserting the code
url = fit.url
Attachments
Issue Links
- is caused by
-
NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
- Closed
- links to