Nutch
  1. Nutch
  2. NUTCH-289

CrawlDatum should store IP address

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None

      Description

      If the CrawlDatum stored the IP address of the host of it's URL, then one could:

      • partition fetch lists on the basis of IP address, for better politeness;
      • truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.

      The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

      1. ipInCrawlDatumDraftV1.patch
        11 kB
        Stefan Groschupf
      2. ipInCrawlDatumDraftV4.patch
        11 kB
        Stefan Groschupf
      3. ipInCrawlDatumDraftV5.1.patch
        12 kB
        Enis Soztutar
      4. ipInCrawlDatumDraftV5.patch
        12 kB
        Stefan Groschupf

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Doug Cutting
          • Votes:
            5 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development