[NUTCH-516] Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: fetcher
Labels:
None
Environment:

Java 1.6, Linux 2.6

Description

We can not crawl some page due to a robots restriction. In this case we update the db with the Metada: pst:robots_denied(18) , we add the status code 3 and we change the fecth interval to 67.5 days.

Unfortunetely the Fetch time is never change, so it keeps generating this page and fetching it every time.
We should update the schedule fetch in crawldb to reflect to the fetch interval.

We should add in crawldbreducer:
case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
if (old != null)
result.setSignature(old.getSignature()); // use old signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime());

// set the schedule
result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);

break;

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-516.patch
18/Jul/07 06:34
1 kB
Emmanuel Joke

Activity

People

Assignee:: Unassigned

Reporter:: Emmanuel Joke

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 17/Jul/07 12:08

Updated:: 10/Apr/09 12:29

Resolved:: 26/Jul/07 08:36