Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2767

Fetcher to stop filling queues skipped due to repeated exceptions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.16
    • 1.17
    • fetcher
    • None
    • Patch Available

    Description

      Since NUTCH-769 the fetcher skips URLs from queues which already got more exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues are emptied when the threshold is reached. However, the QueueFeeder may still feeding queues and add again URLs to the queues which are already over the exception threshold. The first URL in the queue is then fetched, consecutive ones are eventually removed if the next exception is observed.

      Here one example:

      2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions occurred
      2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 172 fetching https://www.example.com/... (queue crawl delay=5000ms)
      2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 172 fetch of https://www.example.com/... failed with: ...
      2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 111 fetching https://www.example.com/... (queue crawl delay=5000ms)
      2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 111 fetch of https://www.example.com/... failed with: ...
      2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 103 fetching https://www.example.com/... (queue crawl delay=5000ms)
      2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 103 fetch of https://www.example.com/... failed with: ...
      2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1 URLs from queue because 41 exceptions occurred
      ... (fetching again 30 URLs, all failed)
      2020-02-19 06:28:23,578 INFO [FetcherThread] org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1 URLs from queue because 42 exceptions occurred
      

      QueueFeeder should check whether the exception threshold is already reached and if yes not add further URLs to the queue.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: