|
[
Permlink
| « Hide
]
Philippe EUGENE added a comment - 13/Jan/06 06:20 PM
Patch for 0.7 and 0.7.1 version
Patch for 0.8-dev version
Couldn't you instead use a prefix-urlfilter generated from your crawl seed?
I have more than 5.000 hosts in my directory. I'm not sure about crawl performance with more than 5.000 rules.
It's easier for me to just manage a boolean value in the nutch conf. I know this is not the natural way of crawl with Nutch, but it could be interested for somes nutch's user. The most important problem : scoring from external links is affected by this patch. We are TENS of nutch users using this precious patch.
Most of nutch users are not making whole-web search engine (too much hardware needed) but are willing to develop dedicated search engines. We crawl sometimes 1000, sometimes 25000 web servers and it really slow down the crawling with 25000 entries in prefix-urlfilter. This patch is NEEDED ! Christophe Noël +1, with a few modifications.
Can you please re-generate this against the current sources? This patch does not apply for me. Also, the fromHost should only be computed if crawl.ignore.external.links is true. Finally, please add an entry to conf/nutch-default.xml for the new parameter in your patch. Thanks! Applies fine and works for me on 0.7.2.
Here is the 08-patch, corrected to work against nightly from 2006-05-20.
Also fromHost is now only generated if really needed and nutch-default.xml is patched as well. By the way: Where should a property for "crawl" be located in the config-file? In the "fetcher"-section? In that case please somebody move it up/down or rename the property before including it in the dev-tree. But could somebody please review it quickly? I'm not sure it's 100% correct. Still investigating on my side ... Patch applied to trunk/ . Thank you!
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||