Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-516

Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • fetcher
    • None
    • Java 1.6, Linux 2.6

    Description

      We can not crawl some page due to a robots restriction. In this case we update the db with the Metada: pst:robots_denied(18) , we add the status code 3 and we change the fecth interval to 67.5 days.

      Unfortunetely the Fetch time is never change, so it keeps generating this page and fetching it every time.
      We should update the schedule fetch in crawldb to reflect to the fetch interval.

      We should add in crawldbreducer:
      case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
      if (old != null)
      result.setSignature(old.getSignature()); // use old signature
      result.setStatus(CrawlDatum.STATUS_DB_GONE);
      result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
      prevModifiedTime, fetch.getFetchTime());

      // set the schedule
      result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
      prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);

      break;

      Attachments

        1. NUTCH-516.patch
          1 kB
          Emmanuel Joke

        Activity

          People

            Unassigned Unassigned
            jokeout Emmanuel Joke
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: