[NUTCH-205] Wrong 'fetch date' for non available pages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7, 0.7.1
Fix Version/s: 0.8.1, 0.9.0
Component/s: fetcher
Labels:
None
Environment:

JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API

Description

Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched anymore.
The next fetch in the web-db is set to Long.max.

Example:
-------------
While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded ttp.max.delays: retry later.

That seems to be ok and indicates some network problems.

The problem is that the entry in the Webdb shows the following:

Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days

The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive.

A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date current time + 1h).

Possible Bug-Fixing:
----------------------------

When updating the web-db the method updateForSegment() in the UpdateDatabaseTool.class,
set the fetch-date always to Long.max for any (unknown) exception during fetching.
The RETRY status is not always set correctly.

Change the following lines:

} else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
page.getRetriesSinceFetch() < MAX_RETRIES)

{ pageRetry(fo); // retry later }

else

{ pageGone(fo); // give up: page is gone }

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: M.Oliver Scheele

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Feb/06 19:39

Updated:: 23/Sep/06 19:49

Resolved:: 23/Sep/06 19:49