Details
Description
I have not changed the following parameter in the nutch-default.xml:
<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered
recoverable errors is generated for fetch.</description>
</property>
However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3):
fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/
This is a bug, right?
Thanks.
Attachments
Attachments
Issue Links
- is duplicated by
-
NUTCH-813 Repetitive crawl 403 status page
- Closed
- relates to
-
NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
- Closed
-
NUTCH-1247 CrawlDatum.retries should be int
- Closed