Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.8
-
None
-
None
Description
Generator uses a host name, as extracted from URL, to determine the maximum number of URLs from a unique host (when generator.max.per.host is set > 0). This supposedly should prevent the situation where fetchlists become dominated by URLs coming from the same hosts, which in turn would clash with "politeness" rules.
However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This leads to the following undesirable behavior:
- if DNS name resolves to different IPs (round-robin balancing), then technically we are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict and permits concurrent accesses to the same host name.
- if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains, web hosting, etc) then the purpose of generate.max.per.host is defeated, because lib-http will block more frequently than intended, leading to excessive numbers of "Exceeded http.max.delays" exceptions.
Proposed solution: synchronize Generator and lib-http in their interpretation of "unique host". Introduce a boolean property which instructs both Generator and lib-http to use in both places either IP addresses or host names as "unique hosts".