[NUTCH-578] URL fetched with 403 is generated over and over again - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.9
Component/s: generator
Labels:
None
Environment:

Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007

Patch Info:

Patch Available

Description

I have not changed the following parameter in the nutch-default.xml:

<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered
recoverable errors is generated for fetch.</description>
</property>

However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3):

fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/

This is a bug, right?

Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

crawl-urlfilter.txt
20/Nov/07 21:42
2 kB
Nathaniel Powell
NUTCH-578_v2.patch
24/Feb/08 15:17
2 kB
Emmanuel Joke
NUTCH-578_v3.patch
31/Mar/09 09:43
2 kB
Dmitry Lihachev
NUTCH-578_v4.patch
04/Feb/10 05:43
2 kB
Evgeniy Serykh
NUTCH-578_v5.patch
29/Oct/12 23:19
1 kB
Sebastian Nagel
NUTCH-578.patch
24/Feb/08 14:54
1 kB
Emmanuel Joke
nutch-site.xml
20/Nov/07 21:51
3 kB
Nathaniel Powell
regex-normalize.xml
20/Nov/07 21:42
2 kB
Nathaniel Powell
urls.txt
20/Nov/07 21:41
0.1 kB
Nathaniel Powell

Issue Links

is duplicated by

NUTCH-813 Repetitive crawl 403 status page

Closed

relates to

NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

Closed

NUTCH-1247 CrawlDatum.retries should be int

Closed

Activity

People

Assignee:: Markus Jelsma

Reporter:: Nathaniel Powell

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Nov/07 21:39

Updated:: 13/Mar/24 14:51

Resolved:: 07/Jul/14 12:39