Nutch
  1. Nutch
  2. NUTCH-272

Max. pages to crawl/fetch per site (emergency limit)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

        Activity

        Hide
        Matt Kangas added a comment -

        I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning.

        Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb".

        URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain after traversing crawldb, during which time it merges the new URLs.

        So, one way to approach it is:

        • Count URLs/domain during "update". If a domain exceeds the limit, write to a file.
        • Read this file at the start of "update" (next cycle) and block further additions
        • Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter()

        If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch.

        Show
        Matt Kangas added a comment - I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning. Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb". URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain after traversing crawldb, during which time it merges the new URLs. So, one way to approach it is: Count URLs/domain during "update". If a domain exceeds the limit, write to a file. Read this file at the start of "update" (next cycle) and block further additions Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter() If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch.
        Hide
        Doug Cutting added a comment -

        Does the existing generate.max.per.host parameter not meet this need?

        Show
        Doug Cutting added a comment - Does the existing generate.max.per.host parameter not meet this need?
        Hide
        Matt Kangas added a comment -

        To my knowledge, no. I believe "generate.max.per.host parameter" merely restricts the URLs/host that can be in a given fetchlist. So on an infinite crawler trap, your crawler won't choke on an infinitely-large fetchlist, but instead continue gnawing away (inifinitely) at the URL space...

        Show
        Matt Kangas added a comment - To my knowledge, no. I believe "generate.max.per.host parameter" merely restricts the URLs/host that can be in a given fetchlist. So on an infinite crawler trap, your crawler won't choke on an infinitely-large fetchlist, but instead continue gnawing away (inifinitely) at the URL space...
        Hide
        Matt Kangas added a comment -

        btw, I'd love to be proven wrong, because if "generate.max.per.host parameter" works as a hard URL cap per site, I could be sleeping better quite soon.

        Show
        Matt Kangas added a comment - btw, I'd love to be proven wrong, because if "generate.max.per.host parameter" works as a hard URL cap per site, I could be sleeping better quite soon.
        Hide
        Stefan Neufeind added a comment -

        Oh, I just discovered this new parameter was added in 0.8-dev

        But to my understanding of the description in nutch-default.xml this only applies to "per fetchlist". And that would mean "for one run", right? So in case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But what if there is one document on the first level (theoretically) with 200 links in it? In this case I suspect that they are all written to the webdb as "to-do" in the first run, in the next the first 100 are fetched with rest skipped and upon another round the next 100 are fetched? Is that right?

        My idea was also to have this as a "per host" or "per site"-setting - or to be able to override the value for a certain host ...

        Show
        Stefan Neufeind added a comment - Oh, I just discovered this new parameter was added in 0.8-dev But to my understanding of the description in nutch-default.xml this only applies to "per fetchlist". And that would mean "for one run", right? So in case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But what if there is one document on the first level (theoretically) with 200 links in it? In this case I suspect that they are all written to the webdb as "to-do" in the first run, in the next the first 100 are fetched with rest skipped and upon another round the next 100 are fetched? Is that right? My idea was also to have this as a "per host" or "per site"-setting - or to be able to override the value for a certain host ...
        Hide
        Ken Krugler added a comment -

        The generate.max.per.host parameter does work, but with the following limitations that we've run into:

        1. The current code uses the entire hostname when deciding max links/host. There are a lot of spammy sites out there that have URLs with the form xxxx-somedomain.com, where xxxx is essentially a random number.

        We've got code that does a better job of deriving the true "base' domain name, but then there's...

        2. Sites that actually have many IP addresses (not sure if they're in a common subnet block or not), where the domain name is xxxx-somedomain.com.

        Because of these two link farm techniques, we ran into cases of 100K links essentially being fetched from the same spam-laden domain, even with a generate.max.per.host setting of 50, after about 40+ loops.

        And what's really unfortunate is that many of these sites are low-bandwidth hosters in Korea and China, so your crawl speed drops dramatically because you're spending all your time waiting for worthless bytes to arrive.

        Show
        Ken Krugler added a comment - The generate.max.per.host parameter does work, but with the following limitations that we've run into: 1. The current code uses the entire hostname when deciding max links/host. There are a lot of spammy sites out there that have URLs with the form xxxx-somedomain.com, where xxxx is essentially a random number. We've got code that does a better job of deriving the true "base' domain name, but then there's... 2. Sites that actually have many IP addresses (not sure if they're in a common subnet block or not), where the domain name is xxxx-somedomain.com. Because of these two link farm techniques, we ran into cases of 100K links essentially being fetched from the same spam-laden domain, even with a generate.max.per.host setting of 50, after about 40+ loops. And what's really unfortunate is that many of these sites are low-bandwidth hosters in Korea and China, so your crawl speed drops dramatically because you're spending all your time waiting for worthless bytes to arrive.
        Hide
        Matt Kangas added a comment -

        Ok, I just re-read Generator.java ( http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?view=markup )

        • Selector.map() keeps values where crawlDatum.getFetchTime() <= curTime
        • Selector.reduce() collects until "limit" is reached, optionally skipping the url if "hostCount.get() > maxPerHost"

        So it caps URLs/host going in this fetchlist. Not total URLs/host. That's what I thought, and is insufficent for the reasons stated above. (Will incrementally fetch everything.)

        If the cap is 50k and a host has 70k active URLs in the crawldb, what Generate needs to say is "Here are the first 50k URLs added for this site, and I see only 3 are scheduled. We'll put 3 in this fetchlist."

        Generate can only enforce a limit if it knows which 50k were first added to the db, and never fetch
        any of the latter 20k.

        Hmm... it seems straightforward to modify Generate.java to count total URLs/host during map(), regardless of fetchTime. But I don't see what action we could take besides halting all fetches for the site. We'd have to traverse crawldb in order of record-creation time to be able to see which were the first N added to the crawldb. (i think the crawldb is sorted by url, not ctime)

        Show
        Matt Kangas added a comment - Ok, I just re-read Generator.java ( http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?view=markup ) Selector.map() keeps values where crawlDatum.getFetchTime() <= curTime Selector.reduce() collects until "limit" is reached, optionally skipping the url if "hostCount.get() > maxPerHost" So it caps URLs/host going in this fetchlist . Not total URLs/host. That's what I thought, and is insufficent for the reasons stated above. (Will incrementally fetch everything.) If the cap is 50k and a host has 70k active URLs in the crawldb, what Generate needs to say is "Here are the first 50k URLs added for this site, and I see only 3 are scheduled. We'll put 3 in this fetchlist." Generate can only enforce a limit if it knows which 50k were first added to the db, and never fetch any of the latter 20k. Hmm... it seems straightforward to modify Generate.java to count total URLs/host during map(), regardless of fetchTime. But I don't see what action we could take besides halting all fetches for the site. We'd have to traverse crawldb in order of record-creation time to be able to see which were the first N added to the crawldb. (i think the crawldb is sorted by url, not ctime)
        Hide
        alan wootton added a comment -

        I don't think you can get whet you want from any change to either of the map-reduce jobs that Generate is composed of.
        What you might need to do is to write another m-r job to precede the Generate.

        Show
        alan wootton added a comment - I don't think you can get whet you want from any change to either of the map-reduce jobs that Generate is composed of. What you might need to do is to write another m-r job to precede the Generate.
        Hide
        Matt Kangas added a comment -

        Agreed that it's looking tough to do in Generate. Alternately, we can try to keep the excess URLs from ever entering the crawldb in CrawlDb.update(). (has its own issues, noted above...)

        Show
        Matt Kangas added a comment - Agreed that it's looking tough to do in Generate. Alternately, we can try to keep the excess URLs from ever entering the crawldb in CrawlDb.update(). (has its own issues, noted above...)
        Hide
        Matt Kangas added a comment -

        Scratch my last comment. I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied during Generate.

        (Wow. This means the crawldb will accumulate lots of junk URLs over time. Is this a feature or a bug?)

        Show
        Matt Kangas added a comment - Scratch my last comment. I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied during Generate. (Wow. This means the crawldb will accumulate lots of junk URLs over time. Is this a feature or a bug?)
        Hide
        Doug Cutting added a comment -

        In 0.8, urls are filtered both when generating and when updating the DB. Strictly speaking, they're only required when updating the DB, but are also applied during generation to allow for changes to the filters. They're also filtered during fetching when following redirects.

        Show
        Doug Cutting added a comment - In 0.8, urls are filtered both when generating and when updating the DB. Strictly speaking, they're only required when updating the DB, but are also applied during generation to allow for changes to the filters. They're also filtered during fetching when following redirects.
        Hide
        Matt Kangas added a comment -

        Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?)

        I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools.

        Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost?

        Show
        Matt Kangas added a comment - Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?) I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools. Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost?
        Hide
        Sami Siren added a comment -

        >Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy,
        >albeit costly for large crawls. (Should have an option to turn off?)

        Url filtering inside generator has been made optional in NUTCH-403

        Show
        Sami Siren added a comment - >Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, >albeit costly for large crawls. (Should have an option to turn off?) Url filtering inside generator has been made optional in NUTCH-403
        Show
        Markus Jelsma added a comment - Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

          People

          • Assignee:
            Unassigned
            Reporter:
            Stefan Neufeind
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development