Issue Details (XML | Word | Printable)

Key: NUTCH-233
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Unassigned
Reporter: Stefan Groschupf
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

wrong regular expression hang reduce process for ever

Created: 16/Mar/06 11:09 AM   Updated: 10/Mar/07 02:41 AM
Return to search
Component/s: None
Affects Version/s: 0.8
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

Resolution Date: 09/Mar/07 10:42 PM


 Description  « Hide
Looks like that the expression ".(/.+?)/.?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter.
May be it was missed to change it when the regular expression packages was changed.
The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java:

I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old regex would match :
"abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
"abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Repository Revision Date User Message
ASF #516592 Fri Mar 09 22:41:24 UTC 2007 kubes NUTCH-233 resolved. Patch supplied by Stefan Groschupf. Thanks Stefan.
Files Changed
MODIFY /lucene/nutch/trunk/conf/crawl-urlfilter.txt.template
MODIFY /lucene/nutch/trunk/conf/regex-urlfilter.txt.template

Repository Revision Date User Message
ASF #516759 Sat Mar 10 18:03:07 UTC 2007 kubes Updated to reflect commits of NUTCH-233 and NUTCH-436.
Files Changed
MODIFY /lucene/nutch/trunk/CHANGES.txt

Repository Revision Date User Message
ASF #516835 Sun Mar 11 01:34:42 UTC 2007 kubes Placed NUTCH-233 and NUTCH-436 into the correct order in the file. :(
Files Changed
MODIFY /lucene/nutch/trunk/CHANGES.txt