Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-205

Wrong 'fetch date' for non available pages

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7, 0.7.1
    • 0.8.1, 0.9.0
    • fetcher
    • None
    • JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API

    Description

      Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched anymore.
      The next fetch in the web-db is set to Long.max.

      Example:
      -------------
      While fetching our URLs, we got some errors like this:
      60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded ttp.max.delays: retry later.

      That seems to be ok and indicates some network problems.

      The problem is that the entry in the Webdb shows the following:

      Page 4: Version: 4
      URL: http://www.test-domain.de/crawl_html/page_2.html
      ID: b360ec931855b0420776909bd96557c0
      Next fetch: Sun Aug 17 07:12:55 CET 292278994
      Retries since fetch: 0
      Retry interval: 0 days

      The 'Next fetch' date is set to the year '292278994'.
      Probably I wouldn't be able to see the refetch alive.

      A page that couldn't be crawled because of networks-problems,
      should be refetched with the next crawl (== set next fetch date current time + 1h).

      Possible Bug-Fixing:
      ----------------------------

      When updating the web-db the method updateForSegment() in the UpdateDatabaseTool.class,
      set the fetch-date always to Long.max for any (unknown) exception during fetching.
      The RETRY status is not always set correctly.

      Change the following lines:

      } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
      page.getRetriesSinceFetch() < MAX_RETRIES)

      { pageRetry(fo); // retry later }

      else

      { pageGone(fo); // give up: page is gone }

      Attachments

        Activity

          People

            Unassigned Unassigned
            mos M.Oliver Scheele
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: