Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-289

CrawlDatum should store IP address

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None

      Description

      If the CrawlDatum stored the IP address of the host of it's URL, then one could:

      • partition fetch lists on the basis of IP address, for better politeness;
      • truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.

      The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

        Attachments

        1. ipInCrawlDatumDraftV1.patch
          11 kB
          Stefan Groschupf
        2. ipInCrawlDatumDraftV4.patch
          11 kB
          Stefan Groschupf
        3. ipInCrawlDatumDraftV5.1.patch
          12 kB
          Enis Soztutar
        4. ipInCrawlDatumDraftV5.patch
          12 kB
          Stefan Groschupf

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              cutting Doug Cutting
            • Votes:
              5 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: