|
Sorry, I haven't such url since it happens until reducing a fetch. Reducing provides no logging and map data will be deleted if the job fails because a timeout.
Sami Siren made changes - 25/Jul/06 07:39 PM
I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages will run into this problem. The problems are for example from spam bot generated urls.
I haven't noticed this regexp being a problem so far either, but maybe I've just been lucky not to have run into bot-trap site yet. Is this still a problem for you, Stefan?
Hi Otis,
yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls that for example comes from link farms the crawler runs into. Could I suggest that this change, from ".(/.?)/.?\1/.?\1/" to ".(/[^/])/[^/]+\1/[^/]+\1/" be committed to at least trunk for the time being.
I recently created a segment with 1M urls exactly, I ran the fetch and it did indeed stall on the reduce part of the operation due to the regex filter. This was verified with a thread dump (kill -3 <pid>) on FreeBSD. I then made the suggested change in the config file and re-fetched the exact same segment. It completed without issue. I'm aware we might be losing some filtering functionality with this new expression, but is it not better then knowing there is always the chance your whole-web crawl fetch will fail because of this?
The new regex has been added to both the regex-urlfilter.txt and the crawl-urlfilter.txt files.
Dennis Kubes made changes - 09/Mar/07 10:42 PM
Dennis Kubes made changes - 10/Mar/07 02:41 AM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I have created a small unit test for urlfilter-regexp and I doesn't notice any incompatibility in java.util.regex with this regexp. Could you please provide the urls that cause problem so that I can add them to me unit tests.
Thanks
Jérôme