Issue Details (XML | Word | Printable)

Key: NUTCH-395
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sami Siren
Reporter: Sami Siren
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Increase fetching speed

Created: 29/Oct/06 08:40 PM   Updated: 18/Apr/07 03:44 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8.1, 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works nutch-0.8-performance.txt 2006-10-29 08:42 PM Sami Siren 77 kB
Text File Licensed for inclusion in ASF works NUTCH-395-trunk-metadata-only-2.patch 2006-11-12 08:31 PM Sami Siren 33 kB
Text File Licensed for inclusion in ASF works NUTCH-395-trunk-metadata-only.patch 2006-11-11 08:56 AM Sami Siren 32 kB
Issue Links:
Reference
 

Resolution Date: 13/Nov/06 07:48 PM


 Description  « Hide
There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes?

Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required)

Reading/writing various data structures - patch tries to do io more efficiently see the patch for details.

Initial benchmark:

A small benchmark was done to measure the performance of changes with a script that basically does the following:
-inject a list of urls into a fresh crawldb
-create fetchlist (10k urls pointing to local filesystem)
-fetch
-updatedb

original code from 0.8-branch:
real 10m51.907s
user 10m9.914s
sys 0m21.285s

after applying the patch
real 4m15.313s
user 3m42.598s
sys 0m18.485s



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.