Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-395

Increase fetching speed

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.1, 0.9.0
    • 0.9.0
    • fetcher
    • None

    Description

      There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes?

      Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required)

      Reading/writing various data structures - patch tries to do io more efficiently see the patch for details.

      Initial benchmark:

      A small benchmark was done to measure the performance of changes with a script that basically does the following:
      -inject a list of urls into a fresh crawldb
      -create fetchlist (10k urls pointing to local filesystem)
      -fetch
      -updatedb

      original code from 0.8-branch:
      real 10m51.907s
      user 10m9.914s
      sys 0m21.285s

      after applying the patch
      real 4m15.313s
      user 3m42.598s
      sys 0m18.485s

      Attachments

        1. NUTCH-395-trunk-metadata-only-2.patch
          33 kB
          Sami Siren
        2. NUTCH-395-trunk-metadata-only.patch
          32 kB
          Sami Siren
        3. nutch-0.8-performance.txt
          77 kB
          Sami Siren

        Issue Links

          Activity

            People

              siren Sami Siren
              siren Sami Siren
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: