[NUTCH-762] Alternative Generator which can generate several segments in one parse of the crawlDB - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1
Component/s: generator
Labels:
None

Patch Info:

Patch Available

Description

When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment.

The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects:

can filter the URLs by score
normalisation is optional
IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
can max the number of URLs per host or domain (but not by IP)
can choose to partition by host, domain or IP

Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP.
We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers.

The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
with the following options :
MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]

where most parameters are similar to the default Generator - apart from :
-noNorm (explicit)
-topN : max number of URLs per segment
-maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments

Please give it a try and less me know what you think of it

Julien Nioche
http://www.digitalpebble.com

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-762-v3.patch
22/Mar/10 10:20
75 kB
Julien Nioche
NUTCH-762-v2.patch
06/Mar/10 13:13
73 kB
Julien Nioche

Issue Links

relates to

NUTCH-1074 topN is ignored with maxNumSegments

Closed

Activity

People

Assignee:: Julien Nioche

Reporter:: Julien Nioche

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Nov/09 15:03

Updated:: 07/Sep/11 09:53

Resolved:: 22/Mar/10 16:23