Description
We can not crawl some page due to a robots restriction. In this case we update the db with the Metada: pst:robots_denied(18) , we add the status code 3 and we change the fecth interval to 67.5 days.
Unfortunetely the Fetch time is never change, so it keeps generating this page and fetching it every time.
We should update the schedule fetch in crawldb to reflect to the fetch interval.
We should add in crawldbreducer:
case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
if (old != null)
result.setSignature(old.getSignature()); // use old signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime());
// set the schedule
result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);
break;