Issue Details (XML | Word | Printable)

Key: NUTCH-395
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sami Siren
Reporter: Sami Siren
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Increase fetching speed

Created: 29/Oct/06 08:40 PM   Updated: 18/Apr/07 03:44 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8.1, 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works nutch-0.8-performance.txt 2006-10-29 08:42 PM Sami Siren 77 kB
Text File Licensed for inclusion in ASF works NUTCH-395-trunk-metadata-only-2.patch 2006-11-12 08:31 PM Sami Siren 33 kB
Text File Licensed for inclusion in ASF works NUTCH-395-trunk-metadata-only.patch 2006-11-11 08:56 AM Sami Siren 32 kB
Issue Links:
Reference
 

Resolution Date: 13/Nov/06 07:48 PM


 Description  « Hide
There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes?

Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required)

Reading/writing various data structures - patch tries to do io more efficiently see the patch for details.

Initial benchmark:

A small benchmark was done to measure the performance of changes with a script that basically does the following:
-inject a list of urls into a fresh crawldb
-create fetchlist (10k urls pointing to local filesystem)
-fetch
-updatedb

original code from 0.8-branch:
real 10m51.907s
user 10m9.914s
sys 0m21.285s

after applying the patch
real 4m15.313s
user 3m42.598s
sys 0m18.485s



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sami Siren added a comment - 29/Oct/06 08:42 PM
a rough patch for testing purposes

Sami Siren made changes - 29/Oct/06 08:42 PM
Field Original Value New Value
Attachment nutch-0.8-performance.txt [ 12343848 ]
Andrzej Bialecki added a comment - 30/Oct/06 09:46 AM
I have several comments to this patch:
  • have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput?
  • I think it's a good idea to separate two concerns with PlainMetadata / MetadataSpellChecker. Since the latter is a subclass I think it would be more appropriate to name it SpellCheckedMetadata.
  • I'd also argue for keeping the name Metadata and just replace the body of the class with PlainMetadata implementation - this way we could avoid changing the API in so many places; for compatibility we could just bump the version number in Metadata. We could then avoid also changes to version id-s of other classes that rely on Metadata, such as Content, ParseData et al.
  • new Metadata / SpellCheckedMetadata need JUnit tests - this is important, because many other classes rely on proper working of these classes.
  • Fetcher.VoidReducer is not needed - I'm guessing you wanted to use it just for logging.
  • please observe formatting rules, especially whitespace rules - this patch doesn't follow them.

Sami Siren added a comment - 31/Oct/06 04:52 PM
>have you measured what made the biggest impact on performance - changes to Metadata, or
>changes to IO in FetcherOutput?
did not have time yet, I would quess that IO changes make most signifigant part.

>I'd also argue for keeping the name Metadata and just replace the body of the class with PlainMetadata
>implementation - this way we could avoid changing the API in so many places; for compatibility we could
>just bump the version number in Metadata. We could then avoid also changes to version id-s of other
>classes that rely on Metadata, such as Content, ParseData et al.

The api for new metadata is exactly the same, but the functionality changed so I decided to make a new class totally, but Yes I agree here, It's much more clean to replace the guts of Metadata class.

>new Metadata / SpellCheckedMetadata need JUnit tests - this is important, because many other classes rely
>on proper working of these classes.
sure, there was supposed to be some allready in the patch but I just forgot to svn add them.

Now that I remember, there was one more odd thing in current implementation: the max number of links was not enforced when writing outlinks only when reading them, I am planning to change this also so the number of links is enforced on write.

>Fetcher.VoidReducer is not needed - I'm guessing you wanted to use it just for logging.
true

>please observe formatting rules, especially whitespace rules - this patch doesn't follow them.

will do, as I said this was not meant to be a demonstration of nice formatting or java coding, just wanted to throw out the
findings for people to try them out. I'll start to work on a new version against trunk and will do it with more focusused mindset


Andrzej Bialecki added a comment - 31/Oct/06 06:46 PM
> Now that I remember, there was one more odd thing in current implementation: the max number
> of links was not enforced when writing outlinks only when reading them, I am planning to change
> this also so the number of links is enforced on write.

AFAIK this was done on purpose, to facilitate processing of existing data created with different settings. I.e. if someone created a segment with high max # of outlinks, you should still be able to read it and process all outlinks. If you enforce the max # during reading you won't be able to process this data.


Sami Siren added a comment - 31/Oct/06 07:04 PM
> settings. I.e. if someone created a segment with high max # of outlinks, you should still be able
> to read it and process all outlinks. If you enforce the max # during reading you won't be able
> to process this data.

Yes i agree, but IMO we should also not store more than configured max # of links, now it seems we
store em all (or am i just not seeing it?).


Sami Siren added a comment - 10/Nov/06 04:44 PM
>>have you measured what made the biggest impact on performance - changes to Metadata, or
>>changes to IO in FetcherOutput?
>did not have time yet, I would quess that IO changes make most signifigant part.

After more digging my initial guess might not have been correct. By not touching IO at all
I am able to get same improvement changing the trunk when comparing to nightly builds as
I reported before on 0.8 branch.

This is good, because we don't need to change file formats at all.


Sami Siren added a comment - 11/Nov/06 08:56 AM
Here's a first stab at svn trunk version of nutch that just optimizes the use of metadata and splits it into two functionally distict pieces one for plain metadata and one for spellchecking over the keys of metadata.

There's propably still room for optimization on both the metadata and IO side also.

The same local filesystem fetching bench was run as earlier, this time on trunk version. Even if the benchmark was run witl file:// urls it should affect other protocols also specifically because it seems to cut down the time needed for reduce phase quite aggressively.

I would also recommend adding some kind of base benchmark for crawling operations to nutch so we don't kill the performance (again and again) at some point.

from svn trunk
----------------------
real 10m43.527s
user 10m11.210s
sys 0m21.837s

fetch breakdown:
5 min 19 sec effective fetching
7 sec sort
4 min 30 sec reduce > reduce

patched version
----------------------
real 4m53.742s
user 4m21.340s
sys 0m19.045s

fetch breakdown:
3 min 36 sec effective fetching
8 sec sort
27 sec reduce > reduce


Sami Siren made changes - 11/Nov/06 08:56 AM
Attachment NUTCH-395-trunk-metadata-only.patch [ 12344791 ]
Sami Siren made changes - 11/Nov/06 08:57 AM
Affects Version/s 0.9.0 [ 12312013 ]
Sami Siren made changes - 11/Nov/06 09:01 AM
Link This issue relates to NUTCH-398 [ NUTCH-398 ]
Sami Siren added a comment - 12/Nov/06 08:31 PM
Additional change to Content cuts down time needed in effective fetching. Now seeing speeds like 45 pages/sec also on http.

real 4m24.126s
user 3m53.835s
sys 0m18.681s

3 min 10 sec effective fetching
6 sec sorting
27 sec reduce > reduce


Sami Siren made changes - 12/Nov/06 08:31 PM
Attachment NUTCH-395-trunk-metadata-only-2.patch [ 12344839 ]
Andrzej Bialecki added a comment - 13/Nov/06 09:59 AM
+1 - this patch looks good to me - if you could just fix the whitespace issues prior to committing, so that it conforms to the coding style ...

Repository Revision Date User Message
ASF #474464 Mon Nov 13 19:46:56 UTC 2006 siren NUTCH-395 Increase fetching speed
Files Changed
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestMetadata.java
ADD /lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/Metadata.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/protocol/TestContent.java
MODIFY /lucene/nutch/trunk/CHANGES.txt

Sami Siren added a comment - 13/Nov/06 07:48 PM
applied to trunk with some additional whitespace changes.

Sami Siren made changes - 13/Nov/06 07:48 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Fix Version/s 0.9.0 [ 12312013 ]
Sami Siren made changes - 18/Apr/07 03:44 PM
Status Resolved [ 5 ] Closed [ 6 ]