|
Stefan Neufeind made changes - 22/May/06 08:12 PM
Stefan Neufeind made changes - 22/May/06 08:13 PM
New patch with just one session-ID-regex extended (also including . - , now), since I came across those extra chars while used on a common German website (www.bahn.de).
Stefan Neufeind made changes - 09/Jul/06 10:32 PM
Andrzej Bialecki made changes - 03/Feb/09 03:16 PM
Commited with some modifications. All patterns in this patch except one have been added in another commit, the remaining one (-S: ...) IMHO occurs too rarely and the pattern would be too incusive. The checking utility has been rewritten to follow a similar model like URLFilterChecker.
Integrated in Nutch-trunk #714 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/714/
Additions to urlnormalizer-regex (modified). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NUTCH-2552) Adds further normalizations
3) Adds a commandline-checker. Start with:
bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker