|
[
Permlink
| « Hide
]
Sami Siren added a comment - 29/Oct/06 08:42 PM
a rough patch for testing purposes
Sami Siren made changes - 29/Oct/06 08:42 PM
I have several comments to this patch:
>have you measured what made the biggest impact on performance - changes to Metadata, or
>changes to IO in FetcherOutput? did not have time yet, I would quess that IO changes make most signifigant part. >I'd also argue for keeping the name Metadata and just replace the body of the class with PlainMetadata The api for new metadata is exactly the same, but the functionality changed so I decided to make a new class totally, but Yes I agree here, It's much more clean to replace the guts of Metadata class. >new Metadata / SpellCheckedMetadata need JUnit tests - this is important, because many other classes rely Now that I remember, there was one more odd thing in current implementation: the max number of links was not enforced when writing outlinks only when reading them, I am planning to change this also so the number of links is enforced on write. >Fetcher.VoidReducer is not needed - I'm guessing you wanted to use it just for logging. >please observe formatting rules, especially whitespace rules - this patch doesn't follow them. will do, as I said this was not meant to be a demonstration of nice formatting or java coding, just wanted to throw out the > Now that I remember, there was one more odd thing in current implementation: the max number
> of links was not enforced when writing outlinks only when reading them, I am planning to change > this also so the number of links is enforced on write. AFAIK this was done on purpose, to facilitate processing of existing data created with different settings. I.e. if someone created a segment with high max # of outlinks, you should still be able to read it and process all outlinks. If you enforce the max # during reading you won't be able to process this data. > settings. I.e. if someone created a segment with high max # of outlinks, you should still be able
> to read it and process all outlinks. If you enforce the max # during reading you won't be able > to process this data. Yes i agree, but IMO we should also not store more than configured max # of links, now it seems we >>have you measured what made the biggest impact on performance - changes to Metadata, or
>>changes to IO in FetcherOutput? >did not have time yet, I would quess that IO changes make most signifigant part. After more digging my initial guess might not have been correct. By not touching IO at all This is good, because we don't need to change file formats at all. Here's a first stab at svn trunk version of nutch that just optimizes the use of metadata and splits it into two functionally distict pieces one for plain metadata and one for spellchecking over the keys of metadata.
There's propably still room for optimization on both the metadata and IO side also. The same local filesystem fetching bench was run as earlier, this time on trunk version. Even if the benchmark was run witl file:// I would also recommend adding some kind of base benchmark for crawling operations to nutch so we don't kill the performance (again and again) at some point. from svn trunk fetch breakdown: patched version fetch breakdown:
Sami Siren made changes - 11/Nov/06 08:56 AM
Sami Siren made changes - 11/Nov/06 08:57 AM
Sami Siren made changes - 11/Nov/06 09:01 AM
Additional change to Content cuts down time needed in effective fetching. Now seeing speeds like 45 pages/sec also on http.
real 4m24.126s 3 min 10 sec effective fetching
Sami Siren made changes - 12/Nov/06 08:31 PM
+1 - this patch looks good to me - if you could just fix the whitespace issues prior to committing, so that it conforms to the coding style ...
applied to trunk with some additional whitespace changes.
Sami Siren made changes - 13/Nov/06 07:48 PM
Sami Siren made changes - 18/Apr/07 03:44 PM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||