Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-289

CrawlDatum should store IP address

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.8
    • None
    • fetcher
    • None

    Description

      If the CrawlDatum stored the IP address of the host of it's URL, then one could:

      • partition fetch lists on the basis of IP address, for better politeness;
      • truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.

      The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

      Attachments

        1. ipInCrawlDatumDraftV1.patch
          11 kB
          Stefan Groschupf
        2. ipInCrawlDatumDraftV4.patch
          11 kB
          Stefan Groschupf
        3. ipInCrawlDatumDraftV5.patch
          12 kB
          Stefan Groschupf
        4. ipInCrawlDatumDraftV5.1.patch
          12 kB
          Enis Soztutar

        Activity

          People

            Unassigned Unassigned
            cutting Doug Cutting
            Votes:
            5 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: