Issue Details (XML | Word | Printable)

Key: NUTCH-243
Type: Bug Bug
Status: Closed Closed
Resolution: Duplicate
Priority: Trivial Trivial
Assignee: Unassigned
Reporter: Dennis Kubes
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Some meta-refresh urls get ignored due to matching regular expression

Created: 05/Apr/06 05:35 AM   Updated: 17/Mar/08 04:49 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Incorporates
 

Resolution Date: 17/Mar/08 04:49 PM


 Description  « Hide
On fetching of pages with meta-refresh tags the url is taken at face value without any filtering. Some urls, such as those used by struts return with a jsessionid or with query strings. Examples are:

http://www.somesite.com;jsessionid=3123123412ADBE3344...
http://www.somesite.com?querystring=value

The RegexURLFilter will match these urls according to the following regex inside of the regex-urlfilter.txt file:

-[?*!@=]

Should these urls be cleaned up to allow processing and not match the previous URL filter or should they be ignored as they currently are?



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dennis Kubes added a comment - 26/Apr/06 02:55 AM
This is resolved by NUTCH-255

Andrzej Bialecki added a comment - 17/Mar/08 04:49 PM
Duplicate of NUTCH-255 .