Issue Details (XML | Word | Printable)

Key: NUTCH-505
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Doğacan Güney
Reporter: Doğacan Güney
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Outlink urls should be validated

Created: 23/Jun/07 08:14 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works filtered.txt 2007-07-12 03:07 PM Doğacan Güney 80 kB
Text File Licensed for inclusion in ASF works NUTCH-505-v2.patch 2007-07-12 12:16 PM Doğacan Güney 8 kB
Text File Licensed for inclusion in ASF works NUTCH-505-v3.patch 2007-07-12 03:07 PM Doğacan Güney 10 kB
Text File Licensed for inclusion in ASF works NUTCH-505.patch 2007-07-10 07:10 PM Doğacan Güney 30 kB
Text File Licensed for inclusion in ASF works NUTCH-505.patch 2007-07-10 12:40 PM Doğacan Güney 30 kB
Text File Licensed for inclusion in ASF works NUTCH-505_draft.patch 2007-06-23 08:19 PM Doğacan Güney 22 kB
Text File Licensed for inclusion in ASF works NUTCH-505_draft_v2.patch 2007-06-24 01:39 PM Doğacan Güney 21 kB

Resolution Date: 11/Jul/07 10:54 AM


 Description  « Hide
See discussion here:
http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html

Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Repository Revision Date User Message
ASF #555237 Wed Jul 11 10:54:37 UTC 2007 dogacan NUTCH-505 - Outlink urls should be validated.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseImpl.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java
MODIFY /lucene/nutch/trunk/CHANGES.txt
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/net/UrlValidator.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/FetcherOutput.java
MODIFY /lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms/MSBaseParser.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/parse/TestParseData.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java
MODIFY /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java

Repository Revision Date User Message
ASF #555969 Fri Jul 13 12:25:45 UTC 2007 dogacan NUTCH-505 - Second part. Optimize UrlValidator by using java.util.regex instead of jakarta-oro. Use initialCapacity for ArrayList-s in ParseOutputFormat. Run url validation and filtering after other tests for better performance.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/net/UrlValidator.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java