[NUTCH-828] Fetch Filter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: 1.8
Component/s: fetcher
Labels:
None
Environment:

All

Patch Info:

Patch Available

Description

Adds a Nutch extension point for a fetch filter. The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments. The fliter can return true if content is to be written or false if it is not.

Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine. In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content. If the content passes, meaning belongs to the set of say sports pages, then we want to include it. If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page. If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments. This effectively stop crawling along the crawl path of that page and the urls from that page. An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-828-1-20100608.patch
08/Jun/10 05:36
23 kB
Dennis Kubes
NUTCH-828-2-20100608.patch
08/Jun/10 15:01
23 kB
Dennis Kubes
NUTCH-828v3.patch
03/Nov/13 13:22
79 kB
Lewis John McGibbney

Activity

People

Assignee:: Dennis Kubes

Reporter:: Dennis Kubes

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Due:: 09/Jun/10

Created:: 08/Jun/10 05:34

Updated:: 13/Mar/24 14:51

Resolved:: 15/Nov/13 17:38