Issue Details (XML | Word | Printable)

Key: NUTCH-403
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sami Siren
Reporter: Sami Siren
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Make URL filtering optional in Generator

Created: 18/Nov/06 09:35 PM   Updated: 18/Apr/07 03:44 PM
Return to search
Component/s: generator
Affects Version/s: None
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works nutch-generate-optional-filtering.patch 2006-11-18 09:38 PM Sami Siren 13 kB

Resolution Date: 19/Nov/06 06:49 PM


 Description  « Hide
As of revision 384219 Generator has used url filtering to filter out urls when generating fetchlists. For a usecase where unwanted urls are filtered out before they enter crawldb filtering in Generator is not required and just consumes resources unneccessarily.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sami Siren added a comment - 18/Nov/06 09:38 PM
Attached patch adds option -noFilter to crawl command (and additional parameter to java api) to control if filtering is desired. JUnit tests are updated to test this new functionality.

Sami Siren added a comment - 18/Nov/06 09:40 PM
The command that is altered is generate (Generator) not crawl.

Andrzej Bialecki added a comment - 19/Nov/06 08:32 AM
Makes sense, +1. The only change I would make is in the name of the new property, I think a full name would be more readable, i.e. "crawl.generate.filter" .

Sami Siren added a comment - 19/Nov/06 06:49 PM
Committed to trunk with change to name of conf parameter.