Issue Details (XML | Word | Printable)

Key: NUTCH-279
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Stefan Neufeind
Votes: 3
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Additions for regex-normalize

Created: 22/May/06 08:09 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: None
Affects Version/s: 0.8
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works regex-normalize.patch 2006-05-22 08:12 PM Stefan Neufeind 4 kB
Text File Licensed for inclusion in ASF works regex-normalize2.patch 2006-07-09 10:32 PM Stefan Neufeind 4 kB
Issue Links:
Incorporates
 

Resolution Date: 03/Feb/09 03:16 PM


 Description  « Hide
Imho needed:
1) Extend normalize-rules to commonly used session-id's etc.
2) Ship a checker to check rules easily by hand

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Neufeind added a comment - 22/May/06 08:12 PM
1) Incorporates jsessionid-normalization from NUTCH-255
2) Adds further normalizations
3) Adds a commandline-checker. Start with:
bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker

Stefan Neufeind added a comment - 09/Jul/06 10:32 PM
New patch with just one session-ID-regex extended (also including . - , now), since I came across those extra chars while used on a common German website (www.bahn.de).

Andrzej Bialecki added a comment - 03/Feb/09 03:16 PM
Commited with some modifications. All patterns in this patch except one have been added in another commit, the remaining one (-S: ...) IMHO occurs too rarely and the pattern would be too incusive. The checking utility has been rewritten to follow a similar model like URLFilterChecker.

Hudson added a comment - 04/Feb/09 04:10 AM
Integrated in Nutch-trunk #714 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/714/)
Additions to urlnormalizer-regex (modified).