Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-669

Consolidate code for Fetcher and Fetcher2

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None

      Description

      I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java.

      It seems to me like there are the following differences:

      • Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself
      • Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it.

      I've begun work on this but want to check with people on the following:

      • What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality?
      • Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard?
      • Any other improvements wanted for Fetcher while I am in and around the code?

        Attachments

          Activity

            People

            • Assignee:
              siren Sami Siren
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: