Description
Hey,
I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:
<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(?|&|#|$)</pattern>.
This pattern also transforms a url, such as,
"&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
matches 'sId' in the 'newsId'), which is incorrect and hence does not
get fetched. This expression needs to be changed to prevent this.
Thanks,
Meghna
Attachments
Attachments
Issue Links
- is duplicated by
-
NUTCH-1328 a problem with regex-normalize.xml
- Closed