|
Patch updated for latest trunk.
btw, for http://www.variety.com/
http:/ Since we will not distribute score to these, this patch may also slightly improve scoring. New patch. This is sort of a release candidate, if there are no objections, I think this patch can go in as it is.
The biggest change is that ParseData is no longer a Configurable. In the current implementation, when a parse data comes to ParseOutputFormat, it contains at most db.max.outlinks.per.page, then after filtering, ParseOutputFormat outputs whatever remains. For example, in a situation where ignoreExternalLinks is true and the first hundred links (assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted, even if there are internal urls past 100th outlinks mark. So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most db.max.outlinks.per.page many outlinks (Also resulting parse data contains db.max.outlinks.per.page outlinks too). I think this is a better approach but it may be a bit slower. Besides this change, UrlValidator code is cleaned up and moved into org.apache.nutch.net package. Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in Outlink.Outlink. There is no point in normalizing them twice.
Other than that, the patch looks great, +1 for committing it after fixing these issues. New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once per getRecordWriter now.
> * you should increase the version number of ParseData, and add a code to read the current version This patch doesn't change how parse data reads outlinks. Before this patch, parse data used to read db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same. Integrated in Nutch-Nightly #147 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/
After my last commit, I read that Sun's java.util.regex implementation is actually faster than jakarta-oro. So, I changed UrlValidator to use java.util.regex instead of jakarta-oro. I made some simple tests and java.util.regex really seems to be faster. I also added some basic optimizations to ParseOutputFormat (added initialCapacity arguments to ArrayLists to reduce the number of allocations).
Is it necessary to reopen this issue or open another issue for this? I think this one is simple enough to commit without opening a seperate issue, but feel free to disagree. Also, I realized that UrlValidator considers [1].gif">http://www.iiit.net/images/CCCCCC_line_br[1].gif Automaton (http://www.brics.dk/automaton/
It doesn't support all regex, but most. Thanks for the suggestion. Automaton really looks good, but using automaton in UrlValidator will mean bringing automaton jar inside nutch core (it currently resides in plugin urlfilter-automaton's lib). I am not sure if that's OK with everyone.
New and final version. I shuffled some code around in ParseOutputFormat for better performance, and updated some regex patterns in UrlValidator.
I am also attaching a file showing which urls are filtered from a sample 2000 url parse. Please test Java 1.5 and Java 1.6 - IIRC there are some differences in performance of java.util.regex between these two versions.
Andrzej, on my tests, java.util.regex is faster on both Java 1.5 and Java 1.6.
And btw, I added ( and ) as valid path characters to the relevant regex pattern because nutch was able to fetch a url containing them. Latest patch (for optimization) is committed in rev. 555969.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This patch is tested very lightly, so it probably doesn't work great yet. Comments, reviews, suggestions are welcome.