Issue Details (XML | Word | Printable)

Key: NUTCH-505
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Doğacan Güney
Reporter: Doğacan Güney
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Outlink urls should be validated

Created: 23/Jun/07 08:14 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works filtered.txt 2007-07-12 03:07 PM Doğacan Güney 80 kB
Text File Licensed for inclusion in ASF works NUTCH-505-v2.patch 2007-07-12 12:16 PM Doğacan Güney 8 kB
Text File Licensed for inclusion in ASF works NUTCH-505-v3.patch 2007-07-12 03:07 PM Doğacan Güney 10 kB
Text File Licensed for inclusion in ASF works NUTCH-505.patch 2007-07-10 07:10 PM Doğacan Güney 30 kB
Text File Licensed for inclusion in ASF works NUTCH-505.patch 2007-07-10 12:40 PM Doğacan Güney 30 kB
Text File Licensed for inclusion in ASF works NUTCH-505_draft.patch 2007-06-23 08:19 PM Doğacan Güney 22 kB
Text File Licensed for inclusion in ASF works NUTCH-505_draft_v2.patch 2007-06-24 01:39 PM Doğacan Güney 21 kB

Resolution Date: 11/Jul/07 10:54 AM


 Description  « Hide
See discussion here:
http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html

Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 23/Jun/07 08:19 PM
Initial draft patch.
  • Uses UrlValidator class from apache commons validator.
  • ParseOutputFormat first checks if an outlink is valid. If it is, then it runs normalizers and urlfilters on url.

This patch is tested very lightly, so it probably doesn't work great yet. Comments, reviews, suggestions are welcome.


Doğacan Güney added a comment - 24/Jun/07 01:39 PM
Patch updated for latest trunk.

Doğacan Güney added a comment - 25/Jun/07 08:08 AM
btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com/</div>
http://www.variety.com/</div></a>
varietycomments@reedbusiness.com
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly improve scoring.


Doğacan Güney added a comment - 10/Jul/07 12:40 PM
New patch. This is sort of a release candidate, if there are no objections, I think this patch can go in as it is.

The biggest change is that ParseData is no longer a Configurable. In the current implementation, when a parse data comes to ParseOutputFormat, it contains at most db.max.outlinks.per.page, then after filtering, ParseOutputFormat outputs whatever remains.

For example, in a situation where ignoreExternalLinks is true and the first hundred links (assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted, even if there are internal urls past 100th outlinks mark.

So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most db.max.outlinks.per.page many outlinks (Also resulting parse data contains db.max.outlinks.per.page outlinks too). I think this is a better approach but it may be a bit slower.

Besides this change, UrlValidator code is cleaned up and moved into org.apache.nutch.net package. Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in Outlink.Outlink. There is no point in normalizing them twice.


Andrzej Bialecki added a comment - 10/Jul/07 01:50 PM
  • In ParseOutputFormat, the calculation of outlinksToStore should not make repeating calls to job.getInt() - the value of db.max.outlinksper.page should be retrieved once per invocation of getRecordWriter().
  • you should increase the version number of ParseData, and add a code to read the current version of ParseData. Otherwise the updated code won't be able to read older segments.

Other than that, the patch looks great, +1 for committing it after fixing these issues.


Doğacan Güney added a comment - 10/Jul/07 07:10 PM - edited
New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once per getRecordWriter now.

> * you should increase the version number of ParseData, and add a code to read the current version
> of ParseData. Otherwise the updated code won't be able to read older segments.

This patch doesn't change how parse data reads outlinks. Before this patch, parse data used to read db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same.


Doğacan Güney added a comment - 11/Jul/07 10:54 AM
Committed in rev. 555237.

Hudson added a comment - 12/Jul/07 06:48 AM

Doğacan Güney added a comment - 12/Jul/07 12:16 PM
After my last commit, I read that Sun's java.util.regex implementation is actually faster than jakarta-oro. So, I changed UrlValidator to use java.util.regex instead of jakarta-oro. I made some simple tests and java.util.regex really seems to be faster. I also added some basic optimizations to ParseOutputFormat (added initialCapacity arguments to ArrayLists to reduce the number of allocations).

Is it necessary to reopen this issue or open another issue for this? I think this one is simple enough to commit without opening a seperate issue, but feel free to disagree.

Also, I realized that UrlValidator considers [1].gif">http://www.iiit.net/images/CCCCCC_line_br[1].gif invalid, even though firefox will display the gif (firefox escapes the path then fetches the escaped url). This doesn't seem to be a problem right now since nutch can't fetch these urls anyway, but we may consider adding some sort of smart escaping later.


Espen Amble Kolstad added a comment - 12/Jul/07 12:32 PM
Automaton (http://www.brics.dk/automaton/), used in AutomatonURLFilter, is even faster if you preparse the regex'es
It doesn't support all regex, but most.

Doğacan Güney added a comment - 12/Jul/07 12:39 PM
Thanks for the suggestion. Automaton really looks good, but using automaton in UrlValidator will mean bringing automaton jar inside nutch core (it currently resides in plugin urlfilter-automaton's lib). I am not sure if that's OK with everyone.

Doğacan Güney added a comment - 12/Jul/07 03:07 PM
New and final version. I shuffled some code around in ParseOutputFormat for better performance, and updated some regex patterns in UrlValidator.

I am also attaching a file showing which urls are filtered from a sample 2000 url parse.


Andrzej Bialecki added a comment - 12/Jul/07 03:17 PM
Please test Java 1.5 and Java 1.6 - IIRC there are some differences in performance of java.util.regex between these two versions.

Doğacan Güney added a comment - 12/Jul/07 06:20 PM
Andrzej, on my tests, java.util.regex is faster on both Java 1.5 and Java 1.6.

And btw, I added ( and ) as valid path characters to the relevant regex pattern because nutch was able to fetch a url containing them.


Doğacan Güney added a comment - 13/Jul/07 12:26 PM
Latest patch (for optimization) is committed in rev. 555969.