Details
Description
Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched anymore.
The next fetch in the web-db is set to Long.max.
Example:
-------------
While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded ttp.max.delays: retry later.
That seems to be ok and indicates some network problems.
The problem is that the entry in the Webdb shows the following:
Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days
The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive.
A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date current time + 1h).
Possible Bug-Fixing:
----------------------------
When updating the web-db the method updateForSegment() in the UpdateDatabaseTool.class,
set the fetch-date always to Long.max for any (unknown) exception during fetching.
The RETRY status is not always set correctly.
Change the following lines:
} else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
page.getRetriesSinceFetch() < MAX_RETRIES)
else
{ pageGone(fo); // give up: page is gone }