Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-339

Refactor nutch to allow fetcher improvements

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      n/a

      Description

      As I (and Stefan?) see it there are two major areas the current fetcher could be
      improved (as in speed)

      1. Politeness code and how it is implemented is the biggest
      problem of current fetcher(together with robots.txt handling).
      With a simple code changes like replacing it with a PriorityQueue
      based solution showed very promising results in increased IO.

      2. Changing fetcher to use non blocking io (this requires great amount
      of work as we need to implement the protocols from scratch again).

      I would like to start with working towards #1 by first refactoring
      the current code (plugins actually) in following way:

      1. Move robots.txt handling away from (lib-http)plugin.
      Even if this is related only to http, leaving it to lib-http
      does not allow other kinds of scheduling strategies to be implemented
      (it is hardcoded to fetch robots.txt from the same thread when requesting
      a page from a site from witch it hasn't tried to load robots.txt)

      2. Move code for politeness away from (lib-http)plugin
      It is really usable outside http and also the current design limits
      changing of the implementation (to queue based)

      Where to move these, well my suggestion is the nutch core, does anybody
      see problems with this?

      These code refactoring activities are to be done in a way that none
      of the current functionality is (at least deliberately) changed leaving
      current functionality as is thus leaving room and possibility to build
      the next generation fetcher(s) without destroying the old one at same time.

        Attachments

        1. patch.txt
          39 kB
          Andrzej Bialecki
        2. patch2.txt
          44 kB
          Andrzej Bialecki
        3. patch3.txt
          44 kB
          Doğacan Güney
        4. patch4-trunk.txt
          39 kB
          Andrzej Bialecki
        5. patch4-fixed.txt
          41 kB
          Andrzej Bialecki
        6. Fetcher2 for .81
          30 kB
          chee.wu

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              siren Sami Siren
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: