Nutch
  1. Nutch
  2. NUTCH-289

CrawlDatum should store IP address

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None

      Description

      If the CrawlDatum stored the IP address of the host of it's URL, then one could:

      • partition fetch lists on the basis of IP address, for better politeness;
      • truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.

      The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

      1. ipInCrawlDatumDraftV5.1.patch
        12 kB
        Enis Soztutar
      2. ipInCrawlDatumDraftV5.patch
        12 kB
        Stefan Groschupf
      3. ipInCrawlDatumDraftV4.patch
        11 kB
        Stefan Groschupf
      4. ipInCrawlDatumDraftV1.patch
        11 kB
        Stefan Groschupf

        Activity

        Hide
        Andrzej Bialecki added a comment -

        I'm not sure how to address round-robin DNS with your approach ...

        Also, I think the best place to resolve and record the IPs is in the fetcher, because it has to do it anyway. When generating we won't know the IPs until the next cycle, but the load on DNS will be much lower / more evenly distributed.

        Show
        Andrzej Bialecki added a comment - I'm not sure how to address round-robin DNS with your approach ... Also, I think the best place to resolve and record the IPs is in the fetcher, because it has to do it anyway. When generating we won't know the IPs until the next cycle, but the load on DNS will be much lower / more evenly distributed.
        Hide
        Matt Kangas added a comment -

        +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise)

        Show
        Matt Kangas added a comment - +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise)
        Hide
        Stefan Groschupf added a comment -

        +1
        Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested.
        The biggest problem nutch has at the moment is spam. The most often seen spam method is to setup a dns return the same ip for all subdomains and than deliver dynamically generated content.
        Than spammers just randomly generate subdomains within the content. Also it happens often that they have many url but all of them pointing to the same server == ip.
        Buying more ip addresses is possible but in the moment more expansive than buying more domains.

        Limit the urls by Ip is a great approach to prevent the crawler staying in honey pots with ten thousends of urls pointing to the same ip.
        However to do so we need to have the ip already until generation and not lookup it when fetching.
        We would be able to reuse the ip in the fetcher, also we can try catch the parts in the fetcher and in case the ip is not available we can re lookup the ip.
        I don't think round robbing dns are huge problem, since only large sites have them and in such a case each ip is able to handle requests.
        In any case storing the ip in crawl-datum and use it for urls by ip limitations will be a gib step forward to in the fight against web spam.

        Show
        Stefan Groschupf added a comment - +1 Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested. The biggest problem nutch has at the moment is spam. The most often seen spam method is to setup a dns return the same ip for all subdomains and than deliver dynamically generated content. Than spammers just randomly generate subdomains within the content. Also it happens often that they have many url but all of them pointing to the same server == ip. Buying more ip addresses is possible but in the moment more expansive than buying more domains. Limit the urls by Ip is a great approach to prevent the crawler staying in honey pots with ten thousends of urls pointing to the same ip. However to do so we need to have the ip already until generation and not lookup it when fetching. We would be able to reuse the ip in the fetcher, also we can try catch the parts in the fetcher and in case the ip is not available we can re lookup the ip. I don't think round robbing dns are huge problem, since only large sites have them and in such a case each ip is able to handle requests. In any case storing the ip in crawl-datum and use it for urls by ip limitations will be a gib step forward to in the fight against web spam.
        Hide
        Andrzej Bialecki added a comment -

        Re: lookup in ParseOutputFormat: I respectfully disagree. Consider the scenario when you run Fetcher in non-parsing mode. This means that you have to make two DNS lookups - once when fetching, and the second time when parsing. These lookups will be executed from different processes, so there is no benefit from caching inside Java resolver, i.e. the process will have to call the DNS server twice. The solution I proposed (record IP-s in Fetcher, but somewhere else than in ParseOutputFormat, e.g. crawl_fetch CrawlDatum) avoids this problem.

        Another issue is virtual hosting, i.e. many sites resolving to a single IP (web hotels). It's true that in many cases these are spam sites, but often as not they are real, legitimate sites. If we generate/fetch by IP address we run the risk of dropping legitimate sites.

        Regarding the timing: it's true that during the first run we won't have IP-s during generate (and subsequently for any newly injected URLs). In fact, since usually a significant part of crawlDB is unfetched we won't have this information for many URLs - unless we run this step in Generator to resolve ALL hosts, and then run an equivalent of updatedb to actually record them in crawldb.

        And the last issue that needs to be discussed: should we use metadata, or add a dedicated field in CrawlDatum? If the core should rely on IP addresses, we should add it as a dedicated field. If it would be purely optional (e.g. for the use by optional plugins), then metadata seems a better place.

        Show
        Andrzej Bialecki added a comment - Re: lookup in ParseOutputFormat: I respectfully disagree. Consider the scenario when you run Fetcher in non-parsing mode. This means that you have to make two DNS lookups - once when fetching, and the second time when parsing. These lookups will be executed from different processes, so there is no benefit from caching inside Java resolver, i.e. the process will have to call the DNS server twice. The solution I proposed (record IP-s in Fetcher, but somewhere else than in ParseOutputFormat, e.g. crawl_fetch CrawlDatum) avoids this problem. Another issue is virtual hosting, i.e. many sites resolving to a single IP (web hotels). It's true that in many cases these are spam sites, but often as not they are real, legitimate sites. If we generate/fetch by IP address we run the risk of dropping legitimate sites. Regarding the timing: it's true that during the first run we won't have IP-s during generate (and subsequently for any newly injected URLs). In fact, since usually a significant part of crawlDB is unfetched we won't have this information for many URLs - unless we run this step in Generator to resolve ALL hosts, and then run an equivalent of updatedb to actually record them in crawldb. And the last issue that needs to be discussed: should we use metadata, or add a dedicated field in CrawlDatum? If the core should rely on IP addresses, we should add it as a dedicated field. If it would be purely optional (e.g. for the use by optional plugins), then metadata seems a better place.
        Hide
        Doug Cutting added a comment -

        It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these features. Ideally we should arrange things for good DNS cache utilization, so that urls with the same host are resolved in a single map or reduce task. Currently this is the case during fetchlist generation, where lists are partitioned by host. Might that be a good place to insert DNS resolution? The fetchlists would need to be processed one more time, to re-partition and re-limit by IP, but fetchlists are relatively small, so this might not slow things too much. The map task itself could directly cache IP addresses, and perhaps even avoid many DNS lookups by using the IP from another CrawlDatum from the same host. A multi-threaded mapper might also be used to allow for network latencies.

        This should, at least initially, be an optional feature, and thus the IP should probably initially be stored in the metadata. I think it might be added as a re-generate step without changing any other code.

        Show
        Doug Cutting added a comment - It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these features. Ideally we should arrange things for good DNS cache utilization, so that urls with the same host are resolved in a single map or reduce task. Currently this is the case during fetchlist generation, where lists are partitioned by host. Might that be a good place to insert DNS resolution? The fetchlists would need to be processed one more time, to re-partition and re-limit by IP, but fetchlists are relatively small, so this might not slow things too much. The map task itself could directly cache IP addresses, and perhaps even avoid many DNS lookups by using the IP from another CrawlDatum from the same host. A multi-threaded mapper might also be used to allow for network latencies. This should, at least initially, be an optional feature, and thus the IP should probably initially be stored in the metadata. I think it might be added as a re-generate step without changing any other code.
        Hide
        Stefan Groschupf added a comment -

        Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may be misunderstand each other.
        Resolve the ip in Parseoutputformat would be only necessary for the new links discovered in the content.
        Since by default we parse during fetching we would have the chance to use the jvm dns cache, since I guess many new urls point to the same host where we fetched a particular page from. Means if we do not parse separately we would have the best jvm cache usage.
        We do not lookup IPs of urls we fetch at this time, since these urls already have a ip that was resoved when these urls was first time discovered in a parse process.
        The only problem we need to handle is what happens in case a ip of a host change. We can simple lookup the ip of a url that throws a protocol error and compare cached and lookup ip.
        An alternative aproche would be to lookup ip's during crawldb update just for the new urls.
        Sorry I hope that describe my ideas more clearly.

        My personal point of view is to store the ip into the crawldatum not into the meta data.

        Show
        Stefan Groschupf added a comment - Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may be misunderstand each other. Resolve the ip in Parseoutputformat would be only necessary for the new links discovered in the content. Since by default we parse during fetching we would have the chance to use the jvm dns cache, since I guess many new urls point to the same host where we fetched a particular page from. Means if we do not parse separately we would have the best jvm cache usage. We do not lookup IPs of urls we fetch at this time, since these urls already have a ip that was resoved when these urls was first time discovered in a parse process. The only problem we need to handle is what happens in case a ip of a host change. We can simple lookup the ip of a url that throws a protocol error and compare cached and lookup ip. An alternative aproche would be to lookup ip's during crawldb update just for the new urls. Sorry I hope that describe my ideas more clearly. My personal point of view is to store the ip into the crawldatum not into the meta data.
        Hide
        Stefan Groschupf added a comment -

        To keep the discussion alive attached a first draft for storing the ip in the crawlDatum for public discussion.

        Some notes.
        The IP is stored as byte[] in the crawlDatum itself not in the meta data.
        There is a IpAddressResolver maprunnable tool to update a crawlDb using multithreaded ip lookups.
        In case a IP is available in the crawlDatum the Generator use the "cached" ip.

        To discuss:
        I don't like the idea of post process the complete crawlDb any time after a update.
        Processing crawlDb is expansive in storage usage and time.
        We can have a property "ipLookups" with possible values <never|duringParsing|postUpdateDb>.
        Than we can add also some code to lookup the IP in the ParseOutputFormat as discussed or we start IpAddressResolver as job in the updateDb tool code.

        In the moment I write the ip address bytes like this:
        out.writeInt(ipAddress.length);
        out.write(ipAddress);
        I think for now we can define that byte[] ipAddress is everytime 4 bytes long, or should we be IPv6 compatible by today?

        Please give me some comments I have a strong interest to get this issue fixed asap and I'm willing to improve things as required.

        Show
        Stefan Groschupf added a comment - To keep the discussion alive attached a first draft for storing the ip in the crawlDatum for public discussion. Some notes. The IP is stored as byte[] in the crawlDatum itself not in the meta data. There is a IpAddressResolver maprunnable tool to update a crawlDb using multithreaded ip lookups. In case a IP is available in the crawlDatum the Generator use the "cached" ip. To discuss: I don't like the idea of post process the complete crawlDb any time after a update. Processing crawlDb is expansive in storage usage and time. We can have a property "ipLookups" with possible values <never|duringParsing|postUpdateDb>. Than we can add also some code to lookup the IP in the ParseOutputFormat as discussed or we start IpAddressResolver as job in the updateDb tool code. In the moment I write the ip address bytes like this: out.writeInt(ipAddress.length); out.write(ipAddress); I think for now we can define that byte[] ipAddress is everytime 4 bytes long, or should we be IPv6 compatible by today? Please give me some comments I have a strong interest to get this issue fixed asap and I'm willing to improve things as required.
        Hide
        Stefan Groschupf added a comment -

        Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte in each crawldatum for now.
        I tested the resolver tool with a 200++mio crawldb and in average a performance of 500 IP lookups / sec per box is possible by using 1000 threads.

        I really would love to get this into the sources as the basic version of having the IP address in the crawlDatum, since I'm working on a tool set of spam detectors that all need ip adresses somehow.
        May be let's exclude the tool but start with the crawlDatum? :-?
        Any improvement suggestions?
        Thanks.

        Show
        Stefan Groschupf added a comment - Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte in each crawldatum for now. I tested the resolver tool with a 200++mio crawldb and in average a performance of 500 IP lookups / sec per box is possible by using 1000 threads. I really would love to get this into the sources as the basic version of having the IP address in the crawlDatum, since I'm working on a tool set of spam detectors that all need ip adresses somehow. May be let's exclude the tool but start with the crawlDatum? :-? Any improvement suggestions? Thanks.
        Hide
        Stefan Groschupf added a comment -

        Release Candidate 1 of this patch.

        This patch contains:
        + add IP Address to CrawlDatum Version 5 (as byte[4])
        + a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded
        + add a property to define if the IpAddress Resolver should be started as a part of the crawlDb update tool to update the parseoutput folder (contains CrawlDatum Status Linked) of a segment before updating the crawlDb.
        + using cached IP during Generation

        Please review this patch and give me any improvement suggestion, I think this is a very important issue, since it helps to do real whole web crawls and not end up in a honey pot after some fetch iterations.
        Also if you like please vote for this issue. Thanks.

        Show
        Stefan Groschupf added a comment - Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte [4] ) + a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded + add a property to define if the IpAddress Resolver should be started as a part of the crawlDb update tool to update the parseoutput folder (contains CrawlDatum Status Linked) of a segment before updating the crawlDb. + using cached IP during Generation Please review this patch and give me any improvement suggestion, I think this is a very important issue, since it helps to do real whole web crawls and not end up in a honey pot after some fetch iterations. Also if you like please vote for this issue. Thanks.
        Hide
        Enis Soztutar added a comment -

        The version 5 patch does not run on the current build. So i have fixed it and resend the patch(did not changed any code). I think this patch should be included in the trunk.

        Show
        Enis Soztutar added a comment - The version 5 patch does not run on the current build. So i have fixed it and resend the patch(did not changed any code). I think this patch should be included in the trunk.
        Hide
        Uros Gruber added a comment -

        One question. Why does IP need to be in CrawlDatum and not in metadata?

        Show
        Uros Gruber added a comment - One question. Why does IP need to be in CrawlDatum and not in metadata?
        Hide
        Doğacan Güney added a comment -

        It seems this issue has kind of died down, but this would be a great feature to have.

        Here is how I think we can do this one (my proposal is heavily based on Stefan Groschupf's work):

        • Add ip as a field to CrawlDatum
        • Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip).
        • A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb.
        • map: <url, CrawlDatum> -> <host of url, <url, CrawlDatum>> . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map().
        • reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache .

        A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch).

        • use cached ip during generation, if it exists.
        Show
        Doğacan Güney added a comment - It seems this issue has kind of died down, but this would be a great feature to have. Here is how I think we can do this one (my proposal is heavily based on Stefan Groschupf's work): Add ip as a field to CrawlDatum Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip). A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb. map: <url, CrawlDatum> -> <host of url, <url, CrawlDatum>> . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map(). reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache . A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch). use cached ip during generation, if it exists.
        Show
        Markus Jelsma added a comment - Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

          People

          • Assignee:
            Unassigned
            Reporter:
            Doug Cutting
          • Votes:
            5 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development