[NUTCH-2627] Fetcher to optionally filter URLs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Implemented
Affects Version/s: 1.16
Fix Version/s: 1.16
Component/s: fetcher
Labels:
None

Flags:

Patch

Description

When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter). Allowing the fetcher to optionally filter URLs would allow to apply changed filter rules to the next launched fetcher job even if the the segment has been already created (esp., if multiple segments are generated in one turn).

Attachments

Issue Links

links to

GitHub Pull Request #370

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jul/18 11:51

Updated:: 28/Jan/21 13:56

Resolved:: 22/Feb/19 14:51