Description
I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java.
It seems to me like there are the following differences:
- Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself
- Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it.
I've begun work on this but want to check with people on the following:
- What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality?
- Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard?
- Any other improvements wanted for Fetcher while I am in and around the code?