Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-669

Consolidate code for Fetcher and Fetcher2

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • fetcher
    • None

    Description

      I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java.

      It seems to me like there are the following differences:

      • Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself
      • Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it.

      I've begun work on this but want to check with people on the following:

      • What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality?
      • Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard?
      • Any other improvements wanted for Fetcher while I am in and around the code?

      Attachments

        Activity

          People

            siren Sami Siren
            tlipcon Todd Lipcon
            Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: