Affects Version/s: 0.8
Fix Version/s: None
Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 Hyperthreaded, Eclipse 3.2.0
Some URLs are filtered out by the crawl url filter for special characters (by default). One of these is the jsessionid urls such as:
We want to get rid of the jessionid and keep everything else so that it looks like this:
Below is a regular expression for the regex-normalize.xml file used by the RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving the hostname and querystring. I have also attached a patch for the regex-normalize.xml.template file that adds the following expression.