[NUTCH-1614] Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: indexer
Labels:
- plugin

Patch Info:

Patch Available

Description

Some pages we need to crawl (such as some main pages and different views of a main page) to get all the other pages, but we don't want to index those pages themselves. Therefore we cannot use the url filter approach.

This plugin uses a file containing regex strings (see included sample file). If one of the regex strings matches with an entire URL, that URL will be excluded form indexing.

The file to use is specified by the following property in nutch-site.xml:

<property>
<name>indexer.url.filter.exclude.regex.file</name>
<value>regex-indexer-exclude-urls.txt</value>
<description>
Holds the file name containing the regex strings. Any URL matching one of these strings will be excluded from indexing.
"#" indicates a comment line and will be ignored.
</description>
</property>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

IndexerJob.java
12/Jun/14 09:07
6 kB
Riyaz Shaik
RegexUtil.java
12/Jun/14 09:07
5 kB
Riyaz Shaik
NUTCH-1614.patch
17/Jul/13 17:16
13 kB
Brian

Issue Links

duplicates

NUTCH-1300 Indexer to filter and normalize URL's

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Brian

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jul/13 17:11

Updated:: 12/Jun/14 09:13