Details
Description
Since NUTCH-769 the fetcher skips URLs from queues which already got more exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues are emptied when the threshold is reached. However, the QueueFeeder may still feeding queues and add again URLs to the queues which are already over the exception threshold. The first URL in the queue is then fetched, consecutive ones are eventually removed if the next exception is observed.
Here one example:
2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions occurred 2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 172 fetching https://www.example.com/... (queue crawl delay=5000ms) 2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 172 fetch of https://www.example.com/... failed with: ... 2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 111 fetching https://www.example.com/... (queue crawl delay=5000ms) 2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 111 fetch of https://www.example.com/... failed with: ... 2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 103 fetching https://www.example.com/... (queue crawl delay=5000ms) 2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: FetcherThread 103 fetch of https://www.example.com/... failed with: ... 2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1 URLs from queue because 41 exceptions occurred ... (fetching again 30 URLs, all failed) 2020-02-19 06:28:23,578 INFO [FetcherThread] org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1 URLs from queue because 42 exceptions occurred
QueueFeeder should check whether the exception threshold is already reached and if yes not add further URLs to the queue.
Attachments
Issue Links
- supercedes
-
NUTCH-1687 Pick queue in Round Robin
- Closed
- links to