Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1713

IpAddressResolver and DNSCache

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • Patch Available

    Description

      Hi Lewis,

      according to the mail I sent to you, I provide my patch for storing ip addresses in apache-nutch-1.5.1 as attachment.

      ( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)

      In our project MIA (http://mia-marktplatz.de/) we spider the german www. To stay polite we had to switch to a 'byIP' policy to guarantee request frequencies of at least one minute per server. Crawling 'byHost' was no option, because many sites use up to some thousand subdomains hosted at a single server with one ip address.
      In proceeding with our crawl I realized that crawling by IP seemed to slow down, because in the process of generating the url lists nutch has to determine the ip address to build up the queues for urls according to their ip addresses.

      This solution is a simple solution which writes the once determined ip address into the metadata field of the CrawlDatum object. When a crawl cycle has finished its fetch job an additional map-reduce job is started to determine the ip addresses of newly fetched and parsed urls. New urls are inserted into the crawldb with their ip addresses if an ip address could have been determined.

      In this solution there exist also the two classes IpAddressResolver.java and DNSCache.java which cache already fetched ip addresses from the DNS and control the number of concurrent calls to the DNS from each map job. Since many urls with the same ip address should be generated into a queue I wanted to minimize the load which is taken to build up the queues. Caching ip addresses in-memory shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS from the crawler, I added some code to restrict the number of parallel requests to the DNS.

      I use this piece of code in production since about three-quarters this year and it seems to work fine. The four configuration entries should be self-explaining.

      Cheers, Walter

      Attachments

        1. NUTCH-1713-trunk.patch
          32 kB
          Lewis John McGibbney

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lewismc Lewis John McGibbney
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: