Issue Details (XML | Word | Printable)

Key: NUTCH-253
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Rod Taylor
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Normalize Host during Generate

Created: 24/Apr/06 10:36 AM   Updated: 23/Sep/06 07:00 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works nutch_hostnormalize.patch 2006-04-24 10:39 AM Rod Taylor 41 kB

Resolution Date: 23/Sep/06 07:00 PM


 Description  « Hide
Extend URL Normalizer to allow for normalizion of the Hostname during Generate. By default no rules are applied.

In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as being the same for generate.max.per.host if an appropriate regex is used.

Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to enable it.

Since several modules now extend the urlnormalizer base we use a "scope" parameter within plugin.xml to allow differentiation between the various urlnormalizer modules to select the right module for Generate.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki added a comment - 23/Sep/06 07:00 PM
This issue is fixed as a part of NUTCH-365 in trunk. Changes were too intrusive to be ported to branch-0.8, although the patch in NUTCH-365 should apply more or less cleanly.