[NUTCH-706] Url regex normalizer: default pattern for session id removal not to match "newsId" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.6, 2.2
Component/s: None
Labels:
None

Description

Hey,

I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:

This pattern also transforms a url, such as,
"&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
matches 'sId' in the 'newsId'), which is incorrect and hence does not
get fetched. This expression needs to be changed to prevent this.

Thanks,
Meghna

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-706-2.patch
08/Aug/12 21:49
3 kB
Sebastian Nagel
NUTCH-706.patch
10/Jul/12 21:29
2 kB
Sebastian Nagel

Issue Links

is duplicated by

NUTCH-1328 a problem with regex-normalize.xml

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Meghna Kukreja

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Feb/09 18:47

Updated:: 22/May/13 03:54

Resolved:: 10/Oct/12 21:14