Issue Details (XML | Word | Printable)

Key: NUTCH-363
Type: Bug Bug
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Doug Cook
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Fetcher normalizes everything at least twice

Created: 08/Sep/06 06:47 PM   Updated: 16/Jan/08 07:25 AM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: None

Time Tracking:
Not Specified

Environment: OS X 10.4.7


 Description  « Hide
New links are normalized twice by the fetcher:

First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL.

The second time is in ParseOutputFormat.write().

For some URLs (e.g. those repeated on a page) a given URL may be normalized a number of times, but it is always normalized at least twice.

For those of us with expensive normalizations, this is probably burning some CPU.

I'd gladly fix this, but I'm not yet familiar enough with the code to know if there are some hidden assumptions which rely on this behavior.

[A related note is that URLs are normalized *before* filtering; this is causing a lot of extra normalization as well. In general, filters may not be safe to run before normalization, but there is likely a class of them which are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer filter" would be a useful one?]



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
iwan cornelius added a comment - 16/Jan/08 06:57 AM
Has this been resolved or a work around found? I'd like to use the normalizer to add a a url to the existing url and this 'feature' is creating problems.

Cheers


Emmanuel Joke added a comment - 16/Jan/08 07:25 AM
FYI, The operation to normalize link within the object Outlink has been removed.