[NUTCH-669] Consolidate code for Fetcher and Fetcher2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: fetcher
Labels:
None

Description

I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java.

It seems to me like there are the following differences:

Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself
Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it.

I've begun work on this but want to check with people on the following:

What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality?

Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard?

Any other improvements wanted for Fetcher while I am in and around the code?

Attachments

Activity

People

Assignee:: Sami Siren

Reporter:: Todd Lipcon

Votes:: 2 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Dec/08 21:12

Updated:: 10/Apr/09 12:29

Resolved:: 02/Mar/09 12:30