[NUTCH-1042] Fetcher.max.crawl.delay property not taken into account correctly when set to -1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3
Fix Version/s: 1.7, 2.2
Component/s: fetcher
Labels:
None

Description

[Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]

From nutch-default.xml:

"
<property>
<name>fetcher.max.crawl.delay</name>
<value>30</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
"

Fetcher.java:
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).

The line 554 in Fetcher.java: "this.maxCrawlDelay =
conf.getInt("fetcher.max.crawl.delay", 30) * 1000;" .

The lines 615-616 in Fetcher.java:

"
if (rules.getCrawlDelay() > 0) {
if (rules.getCrawlDelay() > maxCrawlDelay) {
"

Now, the documentation states that, if fetcher.max.crawl.delay is set to
-1, the crawler will always wait the amount of time the Crawl-Delay
parameter specifies. However, as you can see, if it really is negative
the condition on the line 616 is always true, which leads to skipping
the page whose Crawl-Delay is set.

Attachments

Issue Links

is part of

NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default.

Closed

Activity

People

Assignee:: Lewis John McGibbney

Reporter:: Nutch User - 1

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jul/11 14:29

Updated:: 22/May/13 03:53

Resolved:: 28/Jan/13 08:05