[NUTCH-289] CrawlDatum should store IP address - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.8
Fix Version/s: None
Component/s: fetcher
Labels:
None

Description

If the CrawlDatum stored the IP address of the host of it's URL, then one could:

partition fetch lists on the basis of IP address, for better politeness;
truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.

The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ipInCrawlDatumDraftV1.patch
05/Jun/06 22:51
11 kB
Stefan Groschupf
ipInCrawlDatumDraftV4.patch
07/Jun/06 23:17
11 kB
Stefan Groschupf
ipInCrawlDatumDraftV5.patch
12/Jun/06 22:50
12 kB
Stefan Groschupf
ipInCrawlDatumDraftV5.1.patch
16/Nov/06 08:43
12 kB
Enis Soztutar

Activity

People

Assignee:: Unassigned

Reporter:: Doug Cutting

Votes:: 5 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/May/06 03:36

Updated:: 01/Apr/11 15:03

Resolved:: 01/Apr/11 15:03