|
Does the existing generate.max.per.host parameter not meet this need?
To my knowledge, no. I believe "generate.max.per.host parameter" merely restricts the URLs/host that can be in a given fetchlist. So on an infinite crawler trap, your crawler won't choke on an infinitely-large fetchlist, but instead continue gnawing away (inifinitely) at the URL space...
btw, I'd love to be proven wrong, because if "generate.max.per.host parameter" works as a hard URL cap per site, I could be sleeping better quite soon.
Oh, I just discovered this new parameter was added in 0.8-dev
But to my understanding of the description in nutch-default.xml this only applies to "per fetchlist". And that would mean "for one run", right? So in case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But what if there is one document on the first level (theoretically) with 200 links in it? In this case I suspect that they are all written to the webdb as "to-do" in the first run, in the next the first 100 are fetched with rest skipped and upon another round the next 100 are fetched? Is that right? My idea was also to have this as a "per host" or "per site"-setting - or to be able to override the value for a certain host ... The generate.max.per.host parameter does work, but with the following limitations that we've run into:
1. The current code uses the entire hostname when deciding max links/host. There are a lot of spammy sites out there that have URLs with the form xxxx-somedomain.com, where xxxx is essentially a random number. We've got code that does a better job of deriving the true "base' domain name, but then there's... 2. Sites that actually have many IP addresses (not sure if they're in a common subnet block or not), where the domain name is xxxx-somedomain.com. Because of these two link farm techniques, we ran into cases of 100K links essentially being fetched from the same spam-laden domain, even with a generate.max.per.host setting of 50, after about 40+ loops. And what's really unfortunate is that many of these sites are low-bandwidth hosters in Korea and China, so your crawl speed drops dramatically because you're spending all your time waiting for worthless bytes to arrive. Ok, I just re-read Generator.java ( http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?view=markup
So it caps URLs/host going in this fetchlist. Not total URLs/host. That's what I thought, and is insufficent for the reasons stated above. (Will incrementally fetch everything.) If the cap is 50k and a host has 70k active URLs in the crawldb, what Generate needs to say is "Here are the first 50k URLs added for this site, and I see only 3 are scheduled. We'll put 3 in this fetchlist." Generate can only enforce a limit if it knows which 50k were first added to the db, and never fetch Hmm... it seems straightforward to modify Generate.java to count total URLs/host during map(), regardless of fetchTime. But I don't see what action we could take besides halting all fetches for the site. We'd have to traverse crawldb in order of record-creation time to be able to see which were the first N added to the crawldb. (i think the crawldb is sorted by url, not ctime) I don't think you can get whet you want from any change to either of the map-reduce jobs that Generate is composed of.
What you might need to do is to write another m-r job to precede the Generate. Agreed that it's looking tough to do in Generate. Alternately, we can try to keep the excess URLs from ever entering the crawldb in CrawlDb.update(). (has its own issues, noted above...)
Scratch my last comment.
(Wow. This means the crawldb will accumulate lots of junk URLs over time. Is this a feature or a bug?) In 0.8, urls are filtered both when generating and when updating the DB. Strictly speaking, they're only required when updating the DB, but are also applied during generation to allow for changes to the filters. They're also filtered during fetching when following redirects.
Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?)
I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools. Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost? >Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy,
>albeit costly for large crawls. (Should have an option to turn off?) Url filtering inside generator has been made optional in |
||||||||||||||||||||||||||||||||||||||||||||||
Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb".
URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain after traversing crawldb, during which time it merges the new URLs.
So, one way to approach it is:
If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch.