Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
0.8
-
None
-
None
Description
If the CrawlDatum stored the IP address of the host of it's URL, then one could:
- partition fetch lists on the basis of IP address, for better politeness;
- truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers.
The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.