[NUTCH-1067] Configure minimum throughput for fetcher - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4
Component/s: fetcher
Labels:
None

Patch Info:

Patch Available

Description

Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.

This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.

Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1045-1.4-v2.patch
14/Sep/11 12:05
144 kB
Markus Jelsma
NUTCH-1067-1.4-1.patch
22/Jul/11 14:32
4 kB
Markus Jelsma
NUTCH-1067-1.4-2.patch
02/Aug/11 13:26
5 kB
Markus Jelsma
NUTCH-1067-1.4-3.patch
09/Aug/11 16:26
7 kB
Markus Jelsma
NUTCH-1067-1.4-4.patch
22/Aug/11 13:23
7 kB
Markus Jelsma

Activity

People

Assignee:: Markus Jelsma

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 22/Jul/11 14:31

Updated:: 06/Mar/12 10:59

Resolved:: 16/Sep/11 11:19